Text Search in MotherDuck
Text search is a fundamental operation in data analytics - whether you're finding records by name, searching documents for relevant content, or building question-answering systems. This guide covers search strategies available in MotherDuck, from simple pattern matching to advanced semantic search, and how to combine them for optimal results.
Quick Start: Common Search Patterns
Start here to identify the best search method for your use case. The right search approach depends on what you're searching, how you expect to use search, and what results you need. Most use cases fall into one of three patterns, each linking to detailed implementation guidance below:
Keyword Search Over Identifiers: When searching for specific items like company names, product codes, or customer names, use Exact Match for precise and low-latency lookups. If you need typo tolerance (e.g., "MotheDuck" → "MotherDuck"), use Fuzzy Search.
Keyword Search Over Documents: When searching longer text like articles, product descriptions, or documentation, use Full-Text Search. This ranks documents by keyword relevance, and handles cases where users provide a few keywords that should appear in the content.
Semantic Search: When searching by meaning and similarity rather than exact keywords, use Embedding-based Search. This covers:
- Understanding synonyms (e.g., matching "data warehouse" with "analytics platform")
- Understanding natural language queries (e.g., "wireless headphones with good battery life")
- Finding similar content (e.g., support tickets describing similar customer issues)
For answering natural language questions about structured data (e.g., "How many customers do we have in California?"), see Analytics Agents.
Refining Your Search Strategy
If the patterns above don't fully match your use case, use these four questions to navigate to the right method. Each question links to specific sections with implementation details:
-
What is the search corpus? Consider what you're searching through:
- Identifiers like company names, product IDs, or person names → Exact Match or Fuzzy Search
- Documents like articles, descriptions, or reports → Keyword search (regex) or Full-Text Search (FTS) or Embedding-Based Search or Hybrid (combining FTS + embeddings)
- Structured (numerical) data → Analytics Agents that convert natural language questions to SQL
-
What is the user input? Think about how users express their search:
- Single terms like "MotherDuck" → Exact Match or Fuzzy Search
- Keyword phrases like "data warehouse analytics" → Keyword search (regex) or Full-Text Search or Embedding-based search
- Questions like "What companies offer cloud analytics?" → Embedding-based search with HyDE
- Example documents (finding similar content) → Embedding-based search
-
What is the desired output? Clarify what you're returning:
- Ranked list (retrieval of documents/records) → Covered by this guide
- Generated text answers (RAG-style Q&A, chatbots, summarization) → Use retrieval methods from this guide in combination with the
prompt()function.
-
What is the desired search behavior? Think about what search qualities matter:
- Exact match for specific words (IDs and codes) → Exact Match or Keyword search (regex)
- Typo resilience to handle misspellings like "MotheDuck" → "MotherDuck" → Fuzzy search
- Synonym resilience to match "data warehouse" with "analytics platform" → Embedding-based search
- Customizable ranking → See Reranking in the Advanced Methods section
- Latency and concurrency → See Performance Guide
Search Methods
Exact Match
Use exact match search for specific identifiers, codes, or when you need guaranteed matches. This is the fastest search method.
Using LIKE
For substring matching, use LIKE (or ILIKE for case-insensitive). In patterns, % matches any sequence of characters and _ matches exactly one character.
-- Find places with 'Starbucks' in their name
SELECT name, locality, region
FROM foursquare.main.fsq_os_places
WHERE name LIKE '%Starbucks%'
LIMIT 10;
See also: Pattern Matching in DuckDB documentation
Using Regular Expressions
For more complex pattern matching or matching multiple keywords, use regexp_matches() with (?i) for case-insensitive searches:
-- Find Hacker News posts with 'python', 'javascript', or 'rust' in text
SELECT title, "by", score
FROM sample_data.hn.hacker_news
WHERE regexp_matches(text, '(?i)(python|javascript|rust)')
LIMIT 10;
See also: Regular Expressions in DuckDB documentation
Fuzzy Search (Text Similarity)
Fuzzy search handles typos and spelling variations in entity names like companies, people, or products. Use jaro_winkler_similarity() for most fuzzy matching scenarios - it offers the best balance of accuracy and performance compared to damerau_levenshtein() or levenshtein().
-- Find places similar to 'McDonalds' (handles typo 'McDonalsd')
SELECT
name,
locality,
region,
jaro_winkler_similarity('McDonalsd', name) AS similarity
FROM foursquare.main.fsq_os_places
ORDER BY similarity DESC
LIMIT 10;
See also: Text Similarity Functions in DuckDB documentation
Full-Text Search (FTS)
Full-Text Search ranks documents by keyword relevance using BM25 scoring, which considers both how often terms appear in a document and how rare they are across all documents. Use this for articles, descriptions, or longer text where you need relevance ranking. FTS automatically handles word stemming (e.g., "running" matches "run") and removes common stopwords (like "the", "and", "or"), but requires exact word matches - it won't handle typos in search queries.
Basic FTS Setup
FTS requires write access to the table. Since we're using a read-only example database, we first create a copy of the table in a read-write database we own:
CREATE TABLE hn_stories AS
SELECT id, title, text, "by", score, type
FROM sample_data.hn.hacker_news
WHERE type = 'story'
AND LENGTH(text) > 100
LIMIT 10000;
Build the FTS index on the text column. This creates a new schema called fts_{schema}_{table_name} (in this case fts_main_hn_stories):
PRAGMA create_fts_index(
'hn_stories', -- table name
'id', -- document ID column
'text' -- text column to index
);
Search the index using the match_bm25 function from the newly created schema:
SELECT
id,
title,
text,
fts_main_hn_stories.match_bm25(id, 'database analytics') AS score
FROM hn_stories
ORDER BY score DESC
LIMIT 10;
Index Maintenance
FTS indexes need to be updated when the underlying data changes. Rebuild the index using the overwrite parameter:
PRAGMA create_fts_index('hn_stories', 'id', 'text', overwrite := 1);
See also: Full-Text Search Guide and Full-Text Search Extension in DuckDB documentation
Embedding-Based Search
Embedding-based search finds conceptually similar text by meaning, not keywords. Use this for natural language queries, handling synonyms, or when users search with questions. Embeddings handle synonyms and typos naturally without manual configuration.
The DuckDB VSS extension for approximate vector search (HNSW) is currently experimental, and not supported in MotherDuck's cloud service (Server-Side). Learn more about MotherDuck's support for DuckDB extensions.
Basic Embedding-Based Search Setup
Generate embeddings for your text data, then search using exact vector similarity. For search queries phrased as questions (like "What are the best practices for...?"), see Hypothetical Document Embeddings.
-- Reusing the hn_stories table from the FTS section, add embeddings
ALTER TABLE hn_stories ADD COLUMN text_embedding FLOAT[512];
UPDATE hn_stories SET text_embedding = embedding(text);
-- Semantic search - this will also match texts with related concepts like 'neural networks', 'deep learning', etc.
SELECT
title,
text,
array_cosine_similarity(
embedding('machine learning and artificial intelligence'),
text_embedding
) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 10;
See also: MotherDuck Embedding Function, and array_cosine_similarity in DuckDB documentation
Document Chunking for Embedding-Based Search
When documents are longer than ~2000 characters, consider breaking them into smaller chunks to improve retrieval precision and focus results. For production pipelines with PDFs or Word docs, you can use the MotherDuck integration for Unstructured.io. Otherwise, you can also do document chunking in the database - here are some helpful macros:
-- Fixed-size chunking with configurable overlap
CREATE MACRO chunk_fixed_size(text_col, chunk_size, overlap) AS TABLE (
SELECT
gs.generate_series as chunk_number,
substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size) AS chunk_text
FROM generate_series(1, CAST(CEIL(LENGTH(text_col) / (chunk_size - overlap * 1.0)) AS INTEGER)) gs
WHERE LENGTH(substring(text_col, (gs.generate_series - 1) * (chunk_size - overlap) + 1, chunk_size)) > 50
);
-- Paragraph-based chunking (splits on double newlines)
CREATE MACRO chunk_paragraphs(text_col) AS TABLE (
WITH chunks AS (SELECT string_split(text_col, '\n\n') as arr)
SELECT
UNNEST(generate_series(1, array_length(arr))) as chunk_number,
UNNEST(arr) as chunk_text
FROM chunks
);
-- Sentence-based chunking (splits on sentence boundaries)
CREATE MACRO chunk_sentences(text_col) AS TABLE (
WITH chunks AS (SELECT string_split_regex(text_col, '[.!?]+\s+') as arr)
SELECT
UNNEST(generate_series(1, array_length(arr))) as chunk_number,
UNNEST(arr) as chunk_text
FROM chunks
);
Use one of the macros to create chunks from your documents. Fixed-size chunks (300-600 chars with 10-20% overlap) work well for most use cases:
CREATE OR REPLACE TABLE hn_text_chunks AS
SELECT
id AS post_id,
title,
chunks.chunk_number,
chunks.chunk_text
FROM hn_stories
CROSS JOIN LATERAL chunk_fixed_size(text, 500, 100) chunks;
-- Alternative: CROSS JOIN LATERAL chunk_paragraphs(text) chunks;
-- Alternative: CROSS JOIN LATERAL chunk_sentences(text) chunks;
Generate embeddings for the chunks:
ALTER TABLE hn_text_chunks ADD COLUMN chunk_embedding FLOAT[512];
UPDATE hn_text_chunks SET chunk_embedding = embedding(chunk_text);
Once you have chunks with embeddings, search them the same way as full documents using array_cosine_similarity() - the chunk-level results often provide more precise matches than searching entire documents.
Performance Guide
Search performance depends on several factors, from the chosen search method, to cold vs. warm reads, Duckling sizing, and tenancy model.
When running a search query against your data for the first time (cold read), it may have a higher latency than subsequent queries (warm reads). For production search workloads, ideally dedicate a service account's Duckling primarily to search, so other queries don't compete with search queries. Account for Duckling cooldown periods - the first search query after cooldown may experience more latency.
The DuckDB analytics engine divides data into chunks and processes them in parallel across threads. More data means more chunks to process in parallel, so larger datasets don't necessarily take proportionally longer to search - they just use more threads simultaneously.
Duckling sizing: Optimal latency requires warm reads and enough threads to process your data in parallel. With the ideal Duckling sizing configuration matched to your dataset size, keyword search over identifiers (exact match, fuzzy match) typically achieves latencies in the range of a few hundred milliseconds, while document search (regex, Full-Text Search, embedding search) typically achieves 0.5-3 second latency. Our team is happy to help advise on the right resource allocation for your specific workload and latency targets - get in touch to discuss how we can meet your needs.
Handling Concurrent Requests: For handling multiple simultaneous search requests effectively, consider using read scaling to distribute load across multiple read scaling Ducklings. Alternatively, consider per-user tenancy, providing isolated compute resources for each user.
To optimize further, see the strategies below. For questions or requirements beyond this guide, please get in touch.
Search Optimization Strategies
When optimizing search performance, consider the following options.
Pre-filtering
Reduce the search space using structured metadata (e.g. location, categories, date ranges) that can be inferred from the user's context, before running similarity searches:
-- Create a local copy with embeddings for place names (using a subset)
CREATE TABLE places AS
SELECT fsq_place_id, name, locality, region, fsq_category_labels
FROM foursquare.main.fsq_os_places
WHERE name IS NOT NULL
LIMIT 10000;
-- Add embeddings for semantic search
ALTER TABLE places ADD COLUMN name_embedding FLOAT[512];
UPDATE places SET name_embedding = embedding(name);
-- Pre-filter by location before semantic search
WITH filtered_candidates AS (
SELECT fsq_place_id, name, locality, fsq_category_labels, name_embedding
FROM places
WHERE locality = 'New York' -- Filter by location and region
AND region = 'NY'
)
SELECT
name,
locality,
fsq_category_labels,
array_cosine_similarity(
embedding('italian restaurant'),
name_embedding
) AS similarity
FROM filtered_candidates
ORDER BY similarity DESC
LIMIT 20;
Reducing Embedding Dimensionality
Halving embedding dimensions roughly halves compute time. OpenAI embeddings can be truncated at specific dimensions (256 for text-embedding-3-small, 256 or 512 for text-embedding-3-large). Use lower dimensions for initial pre-filtering, then rerank with full embeddings:
-- Setup: Create normalization macro
CREATE MACRO normalize(v) AS (
CASE
WHEN len(v) = 0 THEN NULL
WHEN sqrt(list_dot_product(v, v)) = 0 THEN NULL
ELSE list_transform(v, element -> element / sqrt(list_dot_product(v, v)))
END
);
-- Add lower-dimensional column (e.g., 256 dims instead of 512)
ALTER TABLE hn_stories ADD COLUMN text_embedding_short FLOAT[256];
UPDATE hn_stories SET text_embedding_short = normalize(text_embedding[1:256]);
Then use a two-stage search:
-- Stage 1: Fast pre-filter with short embeddings
SET VARIABLE query_emb = embedding('machine learning algorithms', 'text-embedding-3-large');
SET VARIABLE query_emb_short = normalize(getvariable('query_emb')[1:256])::FLOAT[256];
WITH candidates AS (
SELECT id,
array_cosine_similarity(getvariable('query_emb_short'), text_embedding_short) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 500 -- Get more candidates if needed
)
-- Stage 2: Rerank with full embeddings
SELECT p.title, p.text,
array_cosine_similarity(getvariable('query_emb'), p.text_embedding) AS final_similarity
FROM hn_stories p
WHERE p.id IN (SELECT id FROM candidates)
ORDER BY final_similarity DESC
LIMIT 10;
FTS Pre-filtering (Hybrid Search)
FTS typically has lower latency than embedding search, making it effective as a pre-filter to reduce similarity comparisons. Use a large LIMIT in the FTS stage to ensure good recall:
-- FTS pre-filter with large limit, then semantic rerank
SET VARIABLE search_query = 'artificial intelligence neural networks';
WITH fts_candidates AS (
SELECT id,
fts_main_hn_stories.match_bm25(id, getvariable('search_query')) AS fts_score
FROM hn_stories
ORDER BY fts_score DESC
LIMIT 10000 -- Large limit to ensure recall
)
SELECT h.id, h.title, h.text,
array_cosine_similarity(
embedding(getvariable('search_query')),
h.text_embedding
) AS similarity
FROM hn_stories h
INNER JOIN fts_candidates f ON h.id = f.id
ORDER BY similarity DESC
LIMIT 10;
See also: Search Using DuckDB Part 3 (Hybrid Search)
Advanced Methods
This section covers additional techniques to customize and improve your search. The methods below demonstrate common approaches - many other variants are possible.
LLM-Enhanced Keyword Expansion
Generate synonyms with an LLM, then use them in pattern matching:
-- Generate synonyms using LLM with structured output
SET VARIABLE search_term = 'programming';
WITH synonyms AS (
SELECT prompt(
'Give me 5 synonyms for ''' || getvariable('search_term') || '''',
struct := {'synonyms': 'VARCHAR[]'}
).synonyms AS synonym_list
)
-- Search with expanded terms
SELECT
title,
text
FROM sample_data.hn.hacker_news, synonyms
WHERE regexp_matches(text, getvariable('search_term') || '|' || array_to_string(synonym_list, '|'))
LIMIT 10;
See also: MotherDuck prompt() Function
Hypothetical Document Embeddings (HyDE)
HyDE improves question-based retrieval by generating a hypothetical answer first, then searching with that answer's embedding. This works because questions and answers have different linguistic patterns - the hypothetical answer better matches actual document content. Use with semantic search or the semantic component of hybrid search.
-- HyDE: Generate hypothetical answer, then search with it
WITH hypothetical_answer AS (
SELECT prompt(
'Answer this question in 2-3 sentences:
"What are the key challenges in building scalable distributed systems?"
Focus on typical technical challenges and solutions.'
) AS answer
)
-- Search using the hypothetical answer's embedding
SELECT
title,
text,
array_cosine_similarity(
(SELECT embedding(answer) FROM hypothetical_answer),
text_embedding
) AS similarity
FROM hn_stories
ORDER BY similarity DESC
LIMIT 10;
See also: Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE paper)
Reranking
Reranking typically happens in two stages: initial retrieval to get top candidates (100-500 results), then precise reranking of that smaller set.
Rule-Based Reranking with Metadata
Refine results based on business rules and metadata like score, category, or freshness:
-- Find similar posts with metadata-based reranking
WITH initial_similarity AS (
-- Step 1: Fast vector similarity for top candidates
SELECT
title,
text,
score as author_score,
array_cosine_similarity(
embedding('artificial intelligence and machine learning applications'),
text_embedding
) AS emb_similarity
FROM hn_stories
ORDER BY emb_similarity DESC
LIMIT 100
),
reranked_scores AS (
-- Step 2: Rerank with metadata (author score)
SELECT
title,
text,
author_score,
emb_similarity,
-- Score boost (normalize to 0-1 range based on actual data)
(author_score / MAX(author_score) OVER ()) AS author_score_norm,
-- Combined final score: 60% semantic + 40% author score
(emb_similarity * 0.6 + author_score_norm * 0.4) AS reranked_score
FROM initial_similarity
)
SELECT
title,
text,
author_score,
ROUND(emb_similarity, 3) as semantic_score,
ROUND(author_score_norm, 3) as author_score_normalized,
ROUND(reranked_score, 3) as final_score
FROM reranked_scores
ORDER BY reranked_score DESC
LIMIT 10;
LLM-Based Reranking
For complex relevance criteria that are hard to express as rules, use an LLM to judge and score results. The prompt() function is optimized for batch processing and processes requests in parallel - so reranking 50 results typically adds only a few hundred milliseconds.
-- LLM reranking for top search results
SET VARIABLE search_query = 'best practices for code review and software quality';
WITH top_candidates AS (
-- Initial retrieval (e.g., via semantic search)
SELECT
id,
title,
text,
array_cosine_similarity(
embedding(getvariable('search_query')),
text_embedding
) AS initial_score
FROM hn_stories
ORDER BY initial_score DESC
LIMIT 20
),
llm_reranked AS (
SELECT
*,
prompt(
format(
'Rate how well this post matches the query ''{}''.
Post: {} - {}',
getvariable('search_query'), title, text
),
struct := {'rating': 'INTEGER'}
).rating AS llm_score
FROM top_candidates
)
SELECT
title,
text,
ROUND(initial_score, 3) as initial_score,
llm_score,
ROUND((0.6 * initial_score + 0.4 * llm_score / 10.0), 3) AS final_score
FROM llm_reranked
ORDER BY final_score DESC
LIMIT 10;
Next Steps
- Check out the MotherDuck Embedding Function and Prompt Function
- Review the Full-Text Search Guide in DuckDB documentation
- Read the MotherDuck blog series: Search Using DuckDB Part 1, Part 2, Part 3
- Explore Building Analytics Agents with MotherDuck