Kaggle Movies
About the dataset
This dataset is a subset of the Kaggle Movies Dataset, containing over 40,000 movie titles and overviews. It also includes pre-computed 512-dimensional vector embeddings (generated with OpenAI's text-embedding-3-small model) for both the title and overview fields, making it useful for experimenting with semantic search in MotherDuck.
How to query the dataset
This dataset is available as part of the sample_data database, which is automatically attached to every MotherDuck account.
Schema
| Column Name | Column Type | Description |
|---|---|---|
| title | VARCHAR | Movie title |
| overview | VARCHAR | Short description or synopsis of the movie |
| title_embeddings | FLOAT[512] | Pre-computed vector embedding of the title |
| overview_embeddings | FLOAT[512] | Pre-computed vector embedding of the overview |
Example queries
Browse movies
SELECT title, overview
FROM sample_data.kaggle.movies
LIMIT 10;
Find similar movies using vector search
Use the pre-computed embeddings together with the embedding function to find movies similar to a search query:
SELECT
title,
overview,
array_cosine_similarity(
overview_embeddings,
embedding('a space adventure with aliens')
) AS similarity
FROM sample_data.kaggle.movies
WHERE overview IS NOT NULL
ORDER BY similarity DESC
LIMIT 10;
Find movies similar to another movie
WITH target AS (
SELECT overview_embeddings
FROM sample_data.kaggle.movies
WHERE title = 'The Matrix'
LIMIT 1
)
SELECT
m.title,
m.overview,
array_cosine_similarity(m.overview_embeddings, t.overview_embeddings) AS similarity
FROM sample_data.kaggle.movies m, target t
WHERE m.title != 'The Matrix'
ORDER BY similarity DESC
LIMIT 10;