MotherDuck Now Speaks Postgres! Our pg_endpoint is now live!Demo - April 21

Skip to main content

Kaggle Movies

About the dataset

This dataset is a subset of the Kaggle Movies Dataset, containing over 40,000 movie titles and overviews. It also includes pre-computed 512-dimensional vector embeddings (generated with OpenAI's text-embedding-3-small model) for both the title and overview fields, making it useful for experimenting with semantic search in MotherDuck.

How to query the dataset

This dataset is available as part of the sample_data database, which is automatically attached to every MotherDuck account.

Schema

Column NameColumn TypeDescription
titleVARCHARMovie title
overviewVARCHARShort description or synopsis of the movie
title_embeddingsFLOAT[512]Pre-computed vector embedding of the title
overview_embeddingsFLOAT[512]Pre-computed vector embedding of the overview

Example queries

Browse movies

SELECT title, overview
FROM sample_data.kaggle.movies
LIMIT 10;

Use the pre-computed embeddings together with the embedding function to find movies similar to a search query:

SELECT
title,
overview,
array_cosine_similarity(
overview_embeddings,
embedding('a space adventure with aliens')
) AS similarity
FROM sample_data.kaggle.movies
WHERE overview IS NOT NULL
ORDER BY similarity DESC
LIMIT 10;

Find movies similar to another movie

WITH target AS (
SELECT overview_embeddings
FROM sample_data.kaggle.movies
WHERE title = 'The Matrix'
LIMIT 1
)
SELECT
m.title,
m.overview,
array_cosine_similarity(m.overview_embeddings, t.overview_embeddings) AS similarity
FROM sample_data.kaggle.movies m, target t
WHERE m.title != 'The Matrix'
ORDER BY similarity DESC
LIMIT 10;