Build a Real-Time CDC Pipeline with Estuary & MotherDuck: March 27thRegister Now

Skip to main content

EMBEDDING

💡Preview Feature

This is a preview feature. Preview features may be operationally incomplete and may offer limited backward compatibility.

Embedding Function​

The embedding function allows you to generate vector representations (embeddings) of text directly from SQL. These embeddings capture semantic meaning, enabling powerful similarity search and other natural language processing tasks.

The function uses OpenAI's models: text-embedding-3-small (default) with 512 dimensions or text-embedding-3-large with 1024 dimensions. Both models support single- and multi-row inputs, enabling batch processing.

The maximum input size is limited to 2048 characters - larger inputs will be truncated.

Consumption is measured in AI Units. One AI Unit equates to approximately:

  • 60,000 embedding rows with text-embedding-3-small
  • 12,000 embedding rows with text-embedding-3-large

These estimates assume an input size of 1,000 characters.

Syntax​

SELECT embedding(my_text_column) FROM my_table; -- returns FLOAT[512] column

Parameters​

The embedding function accepts parameters using named parameter syntax with the := operator.

ParameterRequiredDescription
text_inputYesThe text to be converted into an embedding vector
modelNoModel type, either 'text-embedding-3-small' (default) or 'text-embedding-3-large'

Return Types​

The embedding function returns different array sizes depending on the model used:

  • With text-embedding-3-small: Returns FLOAT[512]
  • With text-embedding-3-large: Returns FLOAT[1024]

Examples​

Basic Embedding Generation​

-- Generate embeddings using the default model (text-embedding-3-small)
SELECT embedding('This is a sample text') AS text_embedding;

-- Generate embeddings using the larger model for higher dimensionality
SELECT embedding('This is a sample text', model:='text-embedding-3-large') AS text_embedding;

Batch Processing​

-- Generate embeddings for multiple rows at once
SELECT
title,
embedding(overview) AS overview_embeddings
FROM kaggle.movies
LIMIT 10;

Use Cases​

Creating an Embedding Database​

This example uses the sample movies dataset from MotherDuck's sample data database.

--- Create a new table with embeddings for the first 100 overview entries
CREATE TABLE my_db.movies AS
SELECT title,
overview,
embedding(overview) AS overview_embeddings
FROM kaggle.movies
LIMIT 100;

If write access to the source table is available, the embedding column can also be added in place:

--- Update the existing table to add new column for embeddings
ALTER TABLE my_db.movies ADD COLUMN overview_embeddings FLOAT[512];

--- Populate the column with embeddings
UPDATE my_db.movies
SET overview_embeddings = embedding(overview);

The movies table now contains a new column overview_embeddings with vector representations of each movie description:

SELECT * FROM my_db.movies;
titleoverviewoverview_embeddings
'Toy Story 3''Led by Woody, Andy's toys live happily in [...]'[0.023089351132512093, -0.012809964828193188, ...]
'Jumanji''When siblings Judy and Peter discover an [...]'[-0.005538413766771555, 0.0799209326505661, ...]
.........

The array_cosine_similarity function can be used to compute similarities between embeddings. This enables semantic search to retrieve entries that are conceptually / semantically similar to a query, even if they don't share the same keywords.

-- Find movies similar to "Toy Story" based on semantic similarity
SELECT
title,
overview,
array_cosine_similarity(
embedding('Led by Woody, Andy''s toys live happily [...]'),
overview_embeddings
) AS similarity
FROM kaggle.movies
WHERE title != 'Toy Story'
ORDER BY similarity DESC
LIMIT 5;
titleoverviewsimilarity
'Toy Story 3''Woody, Buzz, and the rest of Andy's toys haven't [...]'0.7372807860374451
'Toy Story 2''Andy heads off to Cowboy Camp, leaving his toys [...]'0.7222828269004822
.........

Building a Recommendation System​

Embeddings can be used to build content-based recommendation systems:

-- Create a macro to recommend similar movies
CREATE OR REPLACE MACRO recommend_similar_movies(movie_title) AS TABLE (
WITH target_embedding AS (
SELECT embedding(overview) AS emb
FROM sample_data.kaggle.movies
WHERE title = movie_title
LIMIT 1
)
SELECT
m.title AS recommended_title,
m.overview,
array_cosine_similarity(t.emb, m.overview_embeddings) AS similarity
FROM
sample_data.kaggle.movies m,
target_embedding t
WHERE
m.title != movie_title
ORDER BY
similarity DESC
LIMIT 5
);

-- Use the macro to get recommendations
SELECT * FROM recommend_similar_movies('The Matrix');

Retrieval-Augmented Generation (RAG)​

Embeddings are a key component in building RAG systems, which can be combined with the [prompt function] for powerful question-answering capabilities:

-- Create a reusable macro for question answering
CREATE OR REPLACE TEMP MACRO ask_question(question_text) AS TABLE (
SELECT question_text AS question, prompt(
'User asks the following question:\n' || question_text || '\n\n' ||
'Here is some additional information:\n' ||
STRING_AGG('Title: ' || title || '; Description: ' || overview, '\n') || '\n' ||
'Please answer the question based only on the additional information provided.',
model := 'gpt-4o'
) AS response
FROM (
SELECT title, overview
FROM sample_data.kaggle.movies
ORDER BY array_cosine_similarity(overview_embeddings, embedding(question_text)) DESC
LIMIT 3
)
);

-- Use the macro to answer questions
SELECT question, response
FROM ask_question('Can you recommend some good sci-fi movies about AI?');

Security Considerations​

When passing free-text arguments from external sources to the embedding function (e.g., user questions in a RAG application), always use prepared statements to prevent SQL injection.

# Using prepared statements in Python
user_query = "Led by Woody, Andy's toys live happily [...]"
con.execute("""
SELECT title, overview, array_cosine_similarity(embedding(?), overview_embeddings) as similarity
FROM kaggle.movies
ORDER BY similarity DESC
LIMIT 5""", [user_query])

Error Handling​

When usage limits have been reached or an unexpected error occurs while computing embeddings, the function will not fail the entire query but will return NULL values for the affected rows.

To check if all embeddings were computed successfully:

-- Check for NULL values in embedding column
SELECT count(*)
FROM my_db.movies
WHERE overview_embeddings IS NULL AND overview IS NOT NULL;

Missing values can be filled in with a separate query:

-- Fill in missing embedding values
UPDATE my_db.movies
SET overview_embeddings = embedding(overview)
WHERE overview_embeddings IS NULL AND overview IS NOT NULL;

Performance Considerations​

  • Batch Processing: when processing multiple rows, consider using LIMIT to control the number of API calls.
  • Model Selection: use text-embedding-3-small for faster, less expensive embeddings when the highest precision isn't critical.
  • Caching: results are not cached between queries, so consider storing embeddings in tables for repeated use.
  • Dimensionality: higher dimensions (using text-embedding-3-large) provide more precise semantic representation but require more storage and computation time.

Notes​

These capabilities are provided by MotherDuck's integration with OpenAI and inputs to the embedding function will be processed by OpenAI.

For availability and usage limits, see MotherDuck's Pricing Model.

Usage limits are in place to safeguard your spend, not because of throughput limitations. MotherDuck has the capacity to handle high-volume embedding workloads and is always open to working alongside customers to support any type of workload and model requirements.

If higher usage limits are needed, please reach out directly to the Slack support channel or email support@motherduck.com, we're always happy to help!