EMBEDDING
This is a preview feature. Preview features may be operationally incomplete and may offer limited backward compatibility.
Embedding Function​
Text Embeddings can be generated using the embedding
scalar function.
The embedding function uses OpenAI's text-embedding-3-small
model with 512 output dimensions or text-embedding-3-large
with 1024 output dimensions.
The maximum input size is limited to 2048 characters - larger inputs will be truncated.
Consumption is measured in compute units (CU). One CU hr equates to approx. 25k embedding rows with text-embedding-3-small
or 4k embedding rows with text-embedding-3-large
, assuming an input size of 1000 characters.
Syntax​
SELECT embedding(my_text_column) FROM my_table; -- returns FLOAT[512] column
Model Parameter​
By default, the function computes embeddings using text-embedding-3-small
. The model can be changed using the model:=
parameter.
Supported models:
family | name | output dimensions |
---|---|---|
OpenAI | text-embedding-3-small | 512 |
OpenAI | text-embedding-3-large | 1024 |
Example:
SELECT embedding(my_text_column, model:='text-embedding-3-large') FROM my_table; -- returns FLOAT[1024] column
Example: Compute Embeddings​
This example uses the sample movies dataset from MotherDuck's sample data database.
--- create a new table with embedding for the first 100 overview entries
CREATE TABLE my_db.movies AS
SELECT title, overview, embedding(overview) AS overview_embeddings FROM kaggle.movies LIMIT 100
If write access to the source table is available, the embedding column can also be added in place.
--- update the existing table add new column for embeddings
ALTER TABLE my_db.movies ADD COLUMN overview_embeddings FLOAT[512];
--- populate the column with embeddings
UPDATE my_db.movies SET overview_embeddings = embedding(overview);
Output of the movies table overview
embeddings.
SELECT * FROM my_db.movies;
title | overview | overview_embeddings |
---|---|---|
'Toy Story' | 'Led by Woody, Andy's toys live happily in [...]' | [0.023089351132512093, -0.012809964828193188, ...] |
'Jumanji' | 'When siblings Judy and Peter discover an [...]' | [-0.005538413766771555, 0.0799209326505661, ...] |
'Toy Story' | 'Led by Woody, Andy's toys live happily in [...]' | [-0.006752926856279373, 0.09995235502719879, ...] |
... | ... | ... |
Example: Similarity Search​
array_cosine_similarity
can be used to compute similarities between embeddings.
This can be used for example in the context of similarity search, to retrieve entries that are most semantically similar to a given query text.
Below is an example where most similar movies to Toy Story
are found based on it's description.
SELECT title, overview, array_cosine_similarity(embedding('Led by Woody, Andy''s toys live happily [...]'), overview_embeddings) as similarity
FROM kaggle.movies
WHERE title != 'Toy Story'
ORDER BY similarity DESC
LIMIT 1
title | overview | similarity |
---|---|---|
'Toy Story 2' | 'Andy heads off to Cowboy Camp, leaving his toy [...]' | 0.8114261627197266 |
When passing free-text arguments from external sources to the embedding function (e.g. user question in a RAG application), always use prepared statements.
Doing an embedding lookup with a prepared statement in Python:
user_query = "Led by Woody, Andy''s toys live happily [...]"
con.execute("""
SELECT title, overview, array_cosine_similarity(embedding(?), overview_embeddings) as similarity
FROM kaggle.movies
ORDER BY similarity DESC
LIMIT 1""", [user_query])
Error handling​
When usage limits have been reached while the query is running or an unexpected error occurs while computing the embeddings, the embedding function does not fail but will return null values for the affected rows.
To check if all embeddings were computed successfully, check if any of the values in the emb
column are null.
-- check for null values in embedding column
SELECT count(*) FROM my_db.movies WHERE emb IS NULL AND "text" IS NOT NULL;
Missing values can be filled in like this:
-- fill in missing embedding values
UPDATE my_db.movies SET emb = embedding("text") WHERE emb IS NULL AND "text" IS NOT NULL;
Notes​
These capabilities are provided by MotherDuck's integration with OpenAI and inputs to the embedding function will be processed by OpenAI.
For availability and usage limits, see MotherDuck's Pricing Model. If higher usage limits are needed, please reach out directly to Slack support channel or support@motherduck.com.