EMBEDDING

💡Preview Feature

This is a preview feature. Preview features may be operationally incomplete and may offer limited backward compatibility.

Embedding Function

The embedding function allows you to generate vector representations (embeddings) of text directly from SQL. These embeddings capture semantic meaning, enabling powerful similarity search and other natural language processing tasks.

The function uses OpenAI's models: text-embedding-3-small (default) with 512 dimensions or text-embedding-3-large with 1024 dimensions. Both models support single- and multi-row inputs, enabling batch processing.

The maximum input size is limited to 2048 characters - larger inputs will be truncated.

Consumption is measured in AI Units. One AI Unit equates to approximately:

60,000 embedding rows with text-embedding-3-small
12,000 embedding rows with text-embedding-3-large

These estimates assume an input size of 1,000 characters.

Syntax

SELECT embedding(my_text_column) FROM my_table; -- returns FLOAT[512] column

Parameters

The embedding function accepts parameters using named parameter syntax with the := operator.

Parameter	Required	Description
`text_input`	Yes	The text to be converted into an embedding vector
`model`	No	Model type, either `'text-embedding-3-small'` (default) or `'text-embedding-3-large'`

Return Types

The embedding function returns different array sizes depending on the model used:

With text-embedding-3-small: Returns FLOAT[512]
With text-embedding-3-large: Returns FLOAT[1024]

Examples

Basic Embedding Generation

-- Generate embeddings using the default model (text-embedding-3-small)
SELECT embedding('This is a sample text') AS text_embedding;

-- Generate embeddings using the larger model for higher dimensionality
SELECT embedding('This is a sample text', model:='text-embedding-3-large') AS text_embedding;

Batch Processing

-- Generate embeddings for multiple rows at once
SELECT 
    title,
    embedding(overview) AS overview_embeddings
FROM kaggle.movies 
LIMIT 10;

Use Cases

Creating an Embedding Database

This example uses the sample movies dataset from MotherDuck's sample data database.

--- Create a new table with embeddings for the first 100 overview entries
CREATE TABLE my_db.movies AS 
    SELECT title, 
           overview, 
           embedding(overview) AS overview_embeddings 
    FROM kaggle.movies 
    LIMIT 100;

If write access to the source table is available, the embedding column can also be added in place:

--- Update the existing table to add new column for embeddings
ALTER TABLE my_db.movies ADD COLUMN overview_embeddings FLOAT[512];

--- Populate the column with embeddings
UPDATE my_db.movies 
SET overview_embeddings = embedding(overview);

The movies table now contains a new column overview_embeddings with vector representations of each movie description:

SELECT * FROM my_db.movies;

title	overview	overview_embeddings
'Toy Story 3'	'Led by Woody, Andy's toys live happily in [...]'	[0.023089351132512093, -0.012809964828193188, ...]
'Jumanji'	'When siblings Judy and Peter discover an [...]'	[-0.005538413766771555, 0.0799209326505661, ...]
...	...	...

Semantic Similarity Search

The array_cosine_similarity function can be used to compute similarities between embeddings. This enables semantic search to retrieve entries that are conceptually / semantically similar to a query, even if they don't share the same keywords.

-- Find movies similar to "Toy Story" based on semantic similarity
SELECT 
    title, 
    overview, 
    array_cosine_similarity(
        embedding('Led by Woody, Andy''s toys live happily [...]'), 
        overview_embeddings
    ) AS similarity 
FROM kaggle.movies
WHERE title != 'Toy Story'
ORDER BY similarity DESC
LIMIT 5;

title	overview	similarity
'Toy Story 3'	'Woody, Buzz, and the rest of Andy's toys haven't [...]'	0.7372807860374451
'Toy Story 2'	'Andy heads off to Cowboy Camp, leaving his toys [...]'	0.7222828269004822
...	...	...

Building a Recommendation System

Embeddings can be used to build content-based recommendation systems:

-- Create a macro to recommend similar movies
CREATE OR REPLACE MACRO recommend_similar_movies(movie_title) AS TABLE (
    WITH target_embedding AS (
        SELECT embedding(overview) AS emb
        FROM sample_data.kaggle.movies
        WHERE title = movie_title
        LIMIT 1
    )
    SELECT 
        m.title AS recommended_title,
        m.overview,
        array_cosine_similarity(t.emb, m.overview_embeddings) AS similarity
    FROM 
        sample_data.kaggle.movies m,
        target_embedding t
    WHERE 
        m.title != movie_title
    ORDER BY 
        similarity DESC
    LIMIT 5
);

-- Use the macro to get recommendations
SELECT * FROM recommend_similar_movies('The Matrix');

Retrieval-Augmented Generation (RAG)

Embeddings are a key component in building RAG systems, which can be combined with the [prompt function] for powerful question-answering capabilities:

-- Create a reusable macro for question answering
CREATE OR REPLACE TEMP MACRO ask_question(question_text) AS TABLE (
  SELECT question_text AS question, prompt(
    'User asks the following question:\n' || question_text || '\n\n' || 
    'Here is some additional information:\n' || 
    STRING_AGG('Title: ' || title || '; Description: ' || overview, '\n') || '\n' || 
    'Please answer the question based only on the additional information provided.',
    model := 'gpt-4o'
  ) AS response
  FROM (
    SELECT title, overview
    FROM sample_data.kaggle.movies
    ORDER BY array_cosine_similarity(overview_embeddings, embedding(question_text)) DESC 
    LIMIT 3
  )
);

-- Use the macro to answer questions
SELECT question, response 
FROM ask_question('Can you recommend some good sci-fi movies about AI?');

Security Considerations

When passing free-text arguments from external sources to the embedding function (e.g., user questions in a RAG application), always use prepared statements to prevent SQL injection.

# Using prepared statements in Python
user_query = "Led by Woody, Andy's toys live happily [...]"
con.execute("""
SELECT title, overview, array_cosine_similarity(embedding(?), overview_embeddings) as similarity 
FROM kaggle.movies
ORDER BY similarity DESC
LIMIT 5""", [user_query])

Error Handling

When usage limits have been reached or an unexpected error occurs while computing embeddings, the function will not fail the entire query but will return NULL values for the affected rows.

To check if all embeddings were computed successfully:

-- Check for NULL values in embedding column
SELECT count(*) 
FROM my_db.movies
WHERE overview_embeddings IS NULL AND overview IS NOT NULL;

Missing values can be filled in with a separate query:

-- Fill in missing embedding values
UPDATE my_db.movies 
SET overview_embeddings = embedding(overview) 
WHERE overview_embeddings IS NULL AND overview IS NOT NULL;

Performance Considerations

Batch Processing: when processing multiple rows, consider using LIMIT to control the number of API calls.
Model Selection: use text-embedding-3-small for faster, less expensive embeddings when the highest precision isn't critical.
Caching: results are not cached between queries, so consider storing embeddings in tables for repeated use.
Dimensionality: higher dimensions (using text-embedding-3-large) provide more precise semantic representation but require more storage and computation time.

Notes

These capabilities are provided by MotherDuck's integration with OpenAI and inputs to the embedding function will be processed by OpenAI.

For availability and usage limits, see MotherDuck's Pricing Model.

Usage limits are in place to safeguard your spend, not because of throughput limitations. MotherDuck has the capacity to handle high-volume embedding workloads and is always open to working alongside customers to support any type of workload and model requirements.

If higher usage limits are needed, please reach out directly to the Slack support channel or email support@motherduck.com, we're always happy to help!

Embedding Function​

Syntax​

Parameters​

Return Types​

Examples​

Basic Embedding Generation​

Batch Processing​

Use Cases​

Creating an Embedding Database​

Semantic Similarity Search​

Building a Recommendation System​

Retrieval-Augmented Generation (RAG)​

Security Considerations​

Error Handling​

Performance Considerations​

Notes​