Skip to main content

EMBEDDING

info

This is a preview feature. Preview features may be operationally incomplete and may offer limited backward compatibility.

Compute Embeddings

Text Embeddings for a column can be generated using the embedding scalar function.

Syntax

SELECT embedding(my_text_column) FROM my_table; -- returns FLOAT[512] column

Example: Compute Embeddings

This example uses the sample movies dataset from MotherDuck's sample data database.

--- create a new table with embedding for the first 100 overview entries
CREATE TABLE my_db.movies AS
SELECT title, overview, embedding(overview) AS overview_embeddings FROM kaggle.movies LIMIT 100

If write access to the source table is available, the embedding column can also be added in place.

--- update the existing table add new column for embeddings
ALTER TABLE my_db.movies ADD COLUMN overview_embeddings FLOAT[512];
--- populate the column with embeddings
UPDATE my_db.movies SET overview_embeddings = embedding(overview);

Output of the movies table overview embeddings.

SELECT * FROM my_db.movies;
titleoverviewoverview_embeddings
'Toy Story''Led by Woody, Andy's toys live happily in [...]'[0.023089351132512093, -0.012809964828193188, ...]
'Jumanji''When siblings Judy and Peter discover an [...]'[-0.005538413766771555, 0.0799209326505661, ...]
'Toy Story''Led by Woody, Andy's toys live happily in [...]'[-0.006752926856279373, 0.09995235502719879, ...]
.........

array_cosine_similarity can be used to compute similarities between embeddings. This can be used for example in the context of similarity search, to retrieve entries that are most semantically similar to a given query text.

Below is an example where most similar movies to Toy Story are found based on it's description.

SELECT title, overview, array_cosine_similarity(embedding('Led by Woody, Andy''s toys live happily [...]'), overview_embeddings) as similarity 
FROM kaggle.movies
WHERE title != 'Toy Story'
ORDER BY similarity DESC
LIMIT 1
titleoverviewsimilarity
'Toy Story 2''Andy heads off to Cowboy Camp, leaving his toy [...]'0.8114261627197266
warning

When passing free-text arguments from external sources to the embedding function (e.g. user question in a RAG application), always use prepared statements.

Doing an embedding lookup with a prepared statement in Python:

user_query = "Led by Woody, Andy''s toys live happily [...]"
con.execute("""
SELECT title, overview, array_cosine_similarity(embedding(?), overview_embeddings) as similarity
FROM kaggle.movies
ORDER BY similarity DESC
LIMIT 1""", [user_query])

Doing the same embedding lookup with a prepared statement in MotherDuck-WASM:

const user_query = 'Led by Woody, Andy\'s toys live happily [...]';
const result = await connection.evaluatePreparedStatement(`
SELECT title, overview, array_cosine_similarity(embedding(?), overview_embeddings) as similarity
FROM kaggle.movies
ORDER BY similarity DESC
LIMIT 1`, [user_query]);

Error handling

When usage limits have been reached while the query is running or an unexpected error occurs while computing the embeddings, the embedding function does not fail but will return null values for the affected rows.

To check if all embeddings were computed successfully, check if any of the values in the emb column are null.

-- check for null values in embedding column
SELECT count(*) FROM my_db.movies WHERE emb IS NULL AND "text" IS NOT NULL;

Missing values can be filled in like this:

-- fill in missing embedding values
UPDATE my_db.movies SET emb = embedding("text") WHERE emb IS NULL AND "text" IS NOT NULL;

Notes

The embedding function uses OpenAI's text-embedding-3-small model with 512 output dimensions. The maximum input size is limited to 2048 characters - larger inputs will be truncated.

These capabilities are provided by MotherDuck's integration with OpenAI and inputs to the embedding function will be processed by OpenAI. For availability and pricing, see MotherDuck's Pricing Model. If higher usage limits are needed, please reach out directly to Slack support channel or support@motherduck.com.