Skip to main content

Text Search in MotherDuck

Text search is a fundamental operation in data analytics - whether you're finding records by name, searching documents for relevant content, or building question-answering systems. This guide covers search strategies available in MotherDuck, from simple pattern matching to advanced semantic search, and how to combine them for optimal results.

Quick Start: Common Search Patterns

Start here to identify the best search method for your use case. The right search approach depends on what you're searching, how you expect to use search, and what results you need. Most use cases fall into one of three patterns, each linking to detailed implementation guidance below:

Keyword Search Over Identifiers: When searching for specific items like company names, product codes, or customer names, use Exact Match for precise and low-latency lookups. If you need typo tolerance (e.g., "MotheDuck" → "MotherDuck"), use Fuzzy Search.

Keyword Search Over Documents: When searching longer text like articles, product descriptions, or documentation, use Full-Text Search. This ranks documents by keyword relevance, and handles cases where users provide a few keywords that should appear in the content.

Semantic Search: When searching by meaning and similarity rather than exact keywords, use Embedding-based Search. This covers:

  • Understanding synonyms (e.g., matching "data warehouse" with "analytics platform")
  • Understanding natural language queries (e.g., "wireless headphones with good battery life")
  • Finding similar content (e.g., support tickets describing similar customer issues)

For answering natural language questions about structured data (e.g., "How many customers do we have in California?"), see Analytics Agents.

Refining Your Search Strategy

If the patterns above don't fully match your use case, use these four questions to navigate to the right method. Each question links to specific sections with implementation details:

  1. What is the search corpus? Consider what you're searching through:

  2. What is the user input? Think about how users express their search:

  3. What is the desired output? Clarify what you're returning:

    • Ranked list (retrieval of documents/records) → Covered by this guide
    • Generated text answers (RAG-style Q&A, chatbots, summarization) → Use retrieval methods from this guide in combination with the prompt() function.
  4. What is the desired search behavior? Think about what search qualities matter:

Search Methods

Exact Match

Use exact match search for specific identifiers, codes, or when you need guaranteed matches. This is the fastest search method.

Using LIKE

For substring matching, use LIKE (or ILIKE for case-insensitive). In patterns, % matches any sequence of characters and _ matches exactly one character.

-- Find places with 'Starbucks' in their name
SELECT name, locality, region
FROM foursquare.main.fsq_os_places
WHERE name LIKE '%Starbucks%'
LIMIT 10;

See also: Pattern Matching in DuckDB documentation

Using Regular Expressions

For more complex pattern matching or matching multiple keywords, use regexp_matches() with (?i) for case-insensitive searches:

-- Find Hacker News posts with 'python', 'javascript', or 'rust' in text
SELECT title, "by", score
FROM sample_data.hn.hacker_news
WHERE regexp_matches(text, '(?i)(python|javascript|rust)')
LIMIT 10;

See also: Regular Expressions in DuckDB documentation

Fuzzy Search (Text Similarity)

Fuzzy search handles typos and spelling variations in entity names like companies, people, or products. Use jaro_winkler_similarity() for most fuzzy matching scenarios - it offers the best balance of accuracy and performance compared to damerau_levenshtein() or levenshtein().

-- Find places similar to 'McDonalds' (handles typo 'McDonalsd')
SELECT
name,
locality,
region,
jaro_winkler_similarity('McDonalsd', name) AS similarity
FROM foursquare.main.fsq_os_places
ORDER BY similarity DESC
LIMIT 10;

See also: Text Similarity Functions in DuckDB documentation

Full-Text Search (FTS)

Full-Text Search ranks documents by keyword relevance using BM25 scoring, which considers both how often terms appear in a document and how rare they are across all documents. Use this for articles, descriptions, or longer text where you need relevance ranking. FTS automatically handles word stemming (e.g., "running" matches "run") and removes common stopwords (like "the", "and", "or"), but requires exact word matches - it won't handle typos in search queries.

Basic FTS Setup

FTS requires write access to the table. Since we're using a read-only example database, we first create a copy of the table in a read-write database we own:

CREATE TABLE hn_stories AS 
SELECT id, title, text, "by", score, type
FROM sample_data.hn.hacker_news
WHERE type = 'story'
AND LENGTH(text) > 100
LIMIT 10000;

Build the FTS index on the text column. This creates a new schema called fts_{schema}_{table_name} (in this case fts_main_hn_stories):

PRAGMA create_fts_index(
'hn_stories', -- table name
'id', -- document ID column
'text' -- text column to index
);

Search the index using the match_bm25 function from the newly created schema:

SELECT 
id,
title,
text,
fts_main_hn_stories.match_bm25(id, 'database analytics') AS score
FROM hn_stories
ORDER BY score DESC
LIMIT 10;

Index Maintenance

FTS indexes need to be updated when the underlying data changes. Rebuild the index using the overwrite parameter:

PRAGMA create_fts_index('hn_stories', 'id', 'text', overwrite := 1);

See also: Full-Text Search Guide and Full-Text Search Extension in DuckDB documentation

Embedding-based search finds conceptually similar text by meaning, not keywords. Use this for natural language queries, handling synonyms, or when users search with questions. Embeddings handle synonyms and typos naturally without manual configuration.

note

Embedding generation and lookups are priced in AI Units. Business and Lite plans have a default soft limit of 10 AI Units per user/day (sufficient to embed around 600,000 rows) to help prevent unexpected costs. If you'd like to adjust these limits, just ask!

info

The DuckDB VSS extension for approximate vector search (HNSW) is currently experimental, and not supported in MotherDuck's cloud service (Server-Side). Learn more about MotherDuck's support for DuckDB extensions.

Basic Embedding-Based Search Setup

Generate embeddings for your text data, then search using exact vector similarity. For search queries phrased as questions (like "What are the best practices for...?"), see Hypothetical Document Embeddings.

-- Reusing the hn_stories table from the FTS section, add embeddings
ALTER TABLE hn_stories ADD COLUMN text_embedding FLOAT[512];
UPDATE hn_stories SET text_embedding = embedding(text);

-- Semantic search - this will also match texts with related concepts like 'neural networks', 'deep learning', etc.
SELECT
title,
text,
array_cosine_similarity(
embedding('machine learning and artificial intelligence'),
text_embedding
) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 10;

See also: MotherDuck Embedding Function, and array_cosine_similarity in DuckDB documentation

When documents are longer than ~2000 characters, consider breaking them into smaller chunks to improve retrieval precision and focus results. For production pipelines with PDFs or Word docs, you can use the MotherDuck integration for Unstructured.io. Otherwise, you can also do document chunking in the database - here are some helpful macros:

-- Fixed-size chunking with configurable overlap
CREATE MACRO chunk_fixed_size(text_col, chunk_size, overlap) AS TABLE (
SELECT
gs.generate_series as chunk_number,
substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size) AS chunk_text
FROM generate_series(1, CAST(CEIL(LENGTH(text_col) / (chunk_size - overlap * 1.0)) AS INTEGER)) gs
WHERE LENGTH(substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size)) > 50
);

-- Paragraph-based chunking (splits on double newlines)
CREATE MACRO chunk_paragraphs(text_col) AS TABLE (
WITH chunks AS (SELECT string_split(text_col, '\n\n') as arr)
SELECT
UNNEST(generate_series(1, array_length(arr))) as chunk_number,
UNNEST(arr) as chunk_text
FROM chunks
);

-- Sentence-based chunking (splits on sentence boundaries)
CREATE MACRO chunk_sentences(text_col) AS TABLE (
WITH chunks AS (SELECT string_split_regex(text_col, '[.!?]+\s+') as arr)
SELECT
UNNEST(generate_series(1, array_length(arr))) as chunk_number,
UNNEST(arr) as chunk_text
FROM chunks
);

Use one of the macros to create chunks from your documents. Fixed-size chunks (300-600 chars with 10-20% overlap) work well for most use cases:

CREATE OR REPLACE TABLE hn_text_chunks AS
SELECT
id AS post_id,
title,
chunks.chunk_number,
chunks.chunk_text
FROM hn_stories
CROSS JOIN LATERAL chunk_fixed_size(text, 500, 100) chunks;
-- Alternative: CROSS JOIN LATERAL chunk_paragraphs(text) chunks;
-- Alternative: CROSS JOIN LATERAL chunk_sentences(text) chunks;

Generate embeddings for the chunks:

ALTER TABLE hn_text_chunks ADD COLUMN chunk_embedding FLOAT[512];
UPDATE hn_text_chunks SET chunk_embedding = embedding(chunk_text);

Once you have chunks with embeddings, search them the same way as full documents using array_cosine_similarity() - the chunk-level results often provide more precise matches than searching entire documents.

Performance Guide

Search performance depends on several factors, from the chosen search method, to cold vs. warm reads, Duckling sizing, and tenancy model.

When running a search query against your data for the first time (cold read), it may have a higher latency than subsequent queries (warm reads). For production search workloads, ideally dedicate a service account's Duckling primarily to search, so other queries don't compete with search queries. Account for Duckling cooldown periods - the first search query after cooldown may experience more latency.

The DuckDB analytics engine divides data into chunks and processes them in parallel across threads. More data means more chunks to process in parallel, so larger datasets don't necessarily take proportionally longer to search - they just use more threads simultaneously.

Duckling sizing: Optimal latency requires warm reads and enough threads to process your data in parallel. With the ideal Duckling sizing configuration matched to your dataset size, keyword search over identifiers (exact match, fuzzy match) typically achieves latencies in the range of a few hundred milliseconds, while document search (regex, Full-Text Search, embedding search) typically achieves 0.5-3 second latency. Our team is happy to help advise on the right resource allocation for your specific workload and latency targets - get in touch to discuss how we can meet your needs.

Handling Concurrent Requests: For handling multiple simultaneous search requests effectively, consider using read scaling to distribute load across multiple read scaling Ducklings. Alternatively, consider per-user tenancy, providing isolated compute resources for each user.

To optimize further, see the strategies below. For questions or requirements beyond this guide, please get in touch.

Search Optimization Strategies

When optimizing search performance, consider the following options.

Pre-filtering

Reduce the search space using structured metadata (e.g. location, categories, date ranges) that can be inferred from the user's context, before running similarity searches:

-- Create a local copy with embeddings for place names (using a subset)
CREATE TABLE places AS
SELECT fsq_place_id, name, locality, region, fsq_category_labels
FROM foursquare.main.fsq_os_places
WHERE name IS NOT NULL
LIMIT 10000;

-- Add embeddings for semantic search
ALTER TABLE places ADD COLUMN name_embedding FLOAT[512];
UPDATE places SET name_embedding = embedding(name);

-- Pre-filter by location before semantic search
WITH filtered_candidates AS (
SELECT fsq_place_id, name, locality, fsq_category_labels, name_embedding
FROM places
WHERE locality = 'New York' -- Filter by location and region
AND region = 'NY'
)
SELECT
name,
locality,
fsq_category_labels,
array_cosine_similarity(
embedding('italian restaurant'),
name_embedding
) AS similarity
FROM filtered_candidates
ORDER BY similarity DESC
LIMIT 20;

Reducing Embedding Dimensionality

Halving embedding dimensions roughly halves compute time. OpenAI embeddings can be truncated at specific dimensions (256 for text-embedding-3-small, 256 or 512 for text-embedding-3-large). Use lower dimensions for initial pre-filtering, then rerank with full embeddings:

-- Setup: Create normalization macro
CREATE MACRO normalize(v) AS (
CASE
WHEN len(v) = 0 THEN NULL
WHEN sqrt(list_dot_product(v, v)) = 0 THEN NULL
ELSE list_transform(v, element -> element / sqrt(list_dot_product(v, v)))
END
);

-- Add lower-dimensional column (e.g., 256 dims instead of 512)
ALTER TABLE hn_stories ADD COLUMN text_embedding_short FLOAT[256];
UPDATE hn_stories SET text_embedding_short = normalize(text_embedding[1:256]);

Then use a two-stage search:

-- Stage 1: Fast pre-filter with short embeddings
SET VARIABLE query_emb = embedding('machine learning algorithms', 'text-embedding-3-large');
SET VARIABLE query_emb_short = normalize(getvariable('query_emb')[1:256])::FLOAT[256];

WITH candidates AS (
SELECT id,
array_cosine_similarity(getvariable('query_emb_short'), text_embedding_short) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 500 -- Get more candidates if needed
)
-- Stage 2: Rerank with full embeddings
SELECT p.title, p.text,
array_cosine_similarity(getvariable('query_emb'), p.text_embedding) AS final_similarity
FROM hn_stories p
WHERE p.id IN (SELECT id FROM candidates)
ORDER BY final_similarity DESC
LIMIT 10;

FTS typically has lower latency than embedding search, making it effective as a pre-filter to reduce similarity comparisons. Use a large LIMIT in the FTS stage to ensure good recall:

-- FTS pre-filter with large limit, then semantic rerank
SET VARIABLE search_query = 'artificial intelligence neural networks';

WITH fts_candidates AS (
SELECT id,
fts_main_hn_stories.match_bm25(id, getvariable('search_query')) AS fts_score
FROM hn_stories
ORDER BY fts_score DESC
LIMIT 10000 -- Large limit to ensure recall
)
SELECT h.id, h.title, h.text,
array_cosine_similarity(
embedding(getvariable('search_query')),
h.text_embedding
) AS similarity
FROM hn_stories h
INNER JOIN fts_candidates f ON h.id = f.id
ORDER BY similarity DESC
LIMIT 10;

See also: Search Using DuckDB Part 3 (Hybrid Search)

Advanced Methods

This section covers additional techniques to customize and improve your search. The methods below demonstrate common approaches - many other variants are possible.

note

Some methods in this section make use of the prompt() function, which is priced in AI Units. Business and Lite plans have a default soft limit of 10 AI Units per user/day (sufficient to process around 80,000 rows) to help prevent unexpected costs. If you'd like to adjust these limits, just ask!

LLM-Enhanced Keyword Expansion

Generate synonyms with an LLM, then use them in pattern matching:

-- Generate synonyms using LLM with structured output
SET VARIABLE search_term = 'programming';

WITH synonyms AS (
SELECT prompt(
'Give me 5 synonyms for ''' || getvariable('search_term') || '''',
struct := {'synonyms': 'VARCHAR[]'}
).synonyms AS synonym_list
)
-- Search with expanded terms
SELECT
title,
text
FROM sample_data.hn.hacker_news, synonyms
WHERE regexp_matches(text, getvariable('search_term') || '|' || array_to_string(synonym_list, '|'))
LIMIT 10;

See also: MotherDuck prompt() Function

Hypothetical Document Embeddings (HyDE)

HyDE improves question-based retrieval by generating a hypothetical answer first, then searching with that answer's embedding. This works because questions and answers have different linguistic patterns - the hypothetical answer better matches actual document content. Use with semantic search or the semantic component of hybrid search.

-- HyDE: Generate hypothetical answer, then search with it
WITH hypothetical_answer AS (
SELECT prompt(
'Answer this question in 2-3 sentences:
"What are the key challenges in building scalable distributed systems?"

Focus on typical technical challenges and solutions.'
) AS answer
)
-- Search using the hypothetical answer's embedding
SELECT
title,
text,
array_cosine_similarity(
(SELECT embedding(answer) FROM hypothetical_answer),
text_embedding
) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 10;

See also: Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE paper)

Reranking

Reranking typically happens in two stages: initial retrieval to get top candidates (100-500 results), then precise reranking of that smaller set.

Rule-Based Reranking with Metadata

Refine results based on business rules and metadata like score, category, or freshness:

-- Find similar posts with metadata-based reranking
WITH initial_similarity AS (
-- Step 1: Fast vector similarity for top candidates
SELECT
title,
text,
score as author_score,
array_cosine_similarity(
embedding('artificial intelligence and machine learning applications'),
text_embedding
) AS emb_similarity
FROM hn_stories
ORDER BY emb_similarity DESC
LIMIT 100
),
reranked_scores AS (
-- Step 2: Rerank with metadata (author score)
SELECT
title,
text,
author_score,
emb_similarity,
-- Score boost (normalize to 0-1 range based on actual data)
(author_score / MAX(author_score) OVER ()) AS author_score_norm,
-- Combined final score: 60% semantic + 40% author score
(emb_similarity * 0.6 + author_score_norm * 0.4) AS reranked_score
FROM initial_similarity
)
SELECT
title,
text,
author_score,
ROUND(emb_similarity, 3) as semantic_score,
ROUND(author_score_norm, 3) as author_score_normalized,
ROUND(reranked_score, 3) as final_score
FROM reranked_scores
ORDER BY reranked_score DESC
LIMIT 10;

LLM-Based Reranking

For complex relevance criteria that are hard to express as rules, use an LLM to judge and score results. The prompt() function is optimized for batch processing and processes requests in parallel - so reranking 50 results typically adds only a few hundred milliseconds.

-- LLM reranking for top search results
SET VARIABLE search_query = 'best practices for code review and software quality';

WITH top_candidates AS (
-- Initial retrieval (e.g., via semantic search)
SELECT
id,
title,
text,
array_cosine_similarity(
embedding(getvariable('search_query')),
text_embedding
) AS initial_score
FROM hn_stories
ORDER BY initial_score DESC
LIMIT 20
),
llm_reranked AS (
SELECT
*,
prompt(
format(
'Rate how well this post matches the query ''{}''.
Post: {} - {}',
getvariable('search_query'), title, text
),
struct := {'rating': 'INTEGER'}
).rating AS llm_score
FROM top_candidates
)
SELECT
title,
text,
ROUND(initial_score, 3) as initial_score,
llm_score,
ROUND((0.6 * initial_score + 0.4 * llm_score / 10.0), 3) AS final_score
FROM llm_reranked
ORDER BY final_score DESC
LIMIT 10;

Next Steps