Introducing the embedding() function: Semantic search made easy with SQL!

2024/08/14 - 6 min read

BY

While vectors and vector databases are gaining adoption, they require time-consuming, upfront prep work to "bring your own embeddings" to the database. Embeddings are numeric representations of the semantic meaning between words. To operationalize embeddings, you usually need to call another API to translate your data into an opaque vector before you can even put it into a vector database.

Today, we're taking the first step to make it a lot easier to do semantic search and Retrieval-Augmented Generation (RAG) with your data in MotherDuck. We are excited to announce that the embedding() function is now available in Preview on MotherDuck.

In this blog, we’ll walk through an example of how to use this new function - it’s as easy as:

Copy code

SELECT embedding('Ducks are known for their distinctive quacking sound and webbed feet, which make them excellent swimmers.');

By enabling the creation of embeddings with SQL, we open a new set of possibilities for simplifying how we build RAG applications. Using LLMs in your database is now possible without building extensive AI infrastructure to bring them in during any ETL process. You can even create embeddings in your dbt models!

Making it easier to do vector search follows MotherDuck and DuckDB’s philosophy and commitment to making databases easier to use. Finally, data engineers don’t need to leave the context of the database or familiar SQL to translate data into a vector to prep it for vector search.

What are Text Embeddings?

Text embeddings are a way of representing words in a numerical format to capture their semantic meaning. These embeddings can be used for various applications, including similarity search, clustering, classification, and more. By converting text into a high-dimensional vector, embeddings allow you to perform complex NLP tasks with greater efficiency and accuracy.

Semantic Search vs Full-Text Search (FTS)

Full text search scans the entire text for specific word matches. While this is computationally efficient and a great way to ensure that search terms actually appear in the result, FTS can only identify exact textual matches, which means it is unable to parse the semantic meaning of words.

On the other hand, semantic search based on Text Embeddings allows for more flexible search results where related concepts are recognized, even when the exact words used differ. For example, a search for “robots” may return a relatively high similarity score for the term “AI.”

In a previous blog post, we talked about how to combine Full-Text Search with Semantic Search to get the best of both worlds.

Embedding Function Overview

The embedding() function is designed to work seamlessly within your existing SQL workflows. There is no need for external tools or libraries or setting up your own infrastructure: Simply use the function within your SQL queries to compute embeddings on the fly, and incorporate them into any ETL process, including your dbt models.

We use OpenAI’s text-embedding-3-small model with 512 embedding dimensions because it provides the best value for performance, balancing high throughput with high quality embeddings, and are considering adding support for additional models in the future.

Note: This model outperforms OpenAI’s previous ada v2 model on the MTEB benchmark with scores of 62.3 versus 61.0.

How to Use the Embedding Function

Using the embedding() function is straightforward. Let’s take a look at a simple example:

Copy code

SELECT embedding('Ducks are known for their distinctive quacking sound and webbed feet, which make them excellent swimmers.') AS text_embedding;

The above query computes an embedding for the given text and returns the resulting vector. You can also use the function in more complex queries, such as filtering results based on embedding similarity.

Note: Since the embedding() function is relatively compute intensive, using CTAS or UPDATE operations is recommended so that you do not need to recompute embeddings for every comparison operation.

Copy code

ALTER TABLE my_table ADD COLUMN my_embedding FLOAT[512]; UPDATE my_table SET my_embedding = embedding(my_text);

Example Use Case

Let’s dive into an example using embeddings for similarity search, a common and powerful application of text embeddings.

In the following example, we're performing a similarity search to find movies which titles are most similar to a given piece of text, "artificial intelligence." This query uses embeddings to measure the similarity between given search terms and the movies' titles.

Copy code

SELECT title, overview, array_cosine_similarity( embedding('artificial intelligence'), title_embeddings) as similarity FROM kaggle.movies ORDER BY similarity DESC LIMIT 3

Here's a breakdown of what's happening in our query:

  • Embedding Generation: The kaggle.movies sample dataset contains a 'title_embeddings' column with embeddings of each movie title. This column was populated in advance, using the embedding function (see example code in our docs). In order to find the most similar movies to our search terms "artificial intelligence", we generate an embedding for it on the fly, using the embedding("artificial intelligence") expression.
  • Similarity Calculation: Using the array_cosine_similarity function, we then compare our embedding to each movie’s 'title_embedding'. Cosine similarity measures the cosine of the angle between two vectors (in this case, our embeddings), which effectively provides a measure of how similar the documents are in terms of their contents. Finally, we order the results by their cosine similarity and limit the output to the top 3 movies.
  • Results: The query returns the top 3 movies with the highest similarity to the given search terms . In this case, the results might look something like:
    • A.I. Artificial Intelligence with a similarity score of approximately 0.80
    • I, Robot with a similarity score of approximately 0.46
    • Almost Human with a similarity score of approximately 0.45

This type of similarity search is a powerful tool for applications like recommendations, search engines, and retrieval-augmented generation (RAG) because it is able to capture semantic meaning. In a Full Text Search of our dataset, the movies “I, Robot” and “Almost Human” would not have appeared in the results at all due to their textual differences from our search terms.

Start Building

The embedding() function is now available in Preview for MotherDuck users on a Free Trial or the Standard Plan. To get started, check out our documentation to try it out.

Running the embedding() function over a large table may use large amounts of compute, which is why we have decided to set the following plan limits - refer to our pricing page in the docs for a full breakdown:

  • Free Trial users: Up to 25K embedding rows per day
  • Standard Plan users: Up to 1M embedding rows per day (though this can be raised upon request)

We believe the embedding() function will be an enabler for many of our users by providing access to advanced NLP functionality directly within SQL.

Let us know how you’re using the embedding() function and share your success stories and feedback with us on Slack. If you’d like to discuss your use case in more detail, please connect with us - we’d love to learn more about what you’re building, and are curious to know which embedding models you’d like us to support in the future.

Happy querying!

CONTENT
  1. What are Text Embeddings?
  2. Embedding Function Overview
  3. How to Use the Embedding Function
  4. Example Use Case
  5. Start Building

Start using MotherDuck now!

blog subscription icon

Subscribe to motherduck blog

PREVIOUS POSTS

This Month in the DuckDB Ecosystem: August 2024

2024/08/01 - Mehdi Ouazza

This Month in the DuckDB Ecosystem: August 2024

DuckDB Monthly: Memory Management in DuckDB, Monitoring Python Package Usage, Community Extensions Launch, and more!

Redshift Files: The Hunt for Big Data

2024/08/07 - Jordan Tigani

Redshift Files: The Hunt for Big Data

Jordan Tigani revisits his popular Big Data is Dead blog post with analysis of the data from the Redshift TPC is Not Enough paper.