---
sidebar_position: 3
title: Kaggle Movies
description: A dataset of over 40,000 movies with titles, overviews, and pre-computed embeddings for semantic search.
---

import EmbeddedDive from '@site/src/components/EmbeddedDive';
import SQLExampleEditor from '@site/src/components/SQLExampleEditor';

## Explore the data

Interactive dashboard with semantic search on the Kaggle Movies sample dataset. Use it as a starting point for your own [Dives](/key-tasks/ai-and-motherduck/dives/).

<EmbeddedDive
  diveId="3428c1b0-3805-488c-85fd-a707ed818cf1"
  title="Kaggle Movies"
  height="700px"
/>

## About the dataset

This dataset is a subset of the [Kaggle Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset), containing over 40,000 movie titles and overviews. It also includes pre-computed 512-dimensional vector embeddings (generated with OpenAI's `text-embedding-3-small` model) for both the title and overview fields, making it useful for experimenting with [semantic search](/key-tasks/ai-and-motherduck/text-search-in-motherduck/) in MotherDuck.

## How to query the dataset

This dataset is available as part of the `sample_data` database, which is automatically attached to every MotherDuck account.

## Example queries

### Browse movies

<SQLExampleEditor>{`
SELECT title, overview
FROM sample_data.kaggle.movies
LIMIT 10;
`}</SQLExampleEditor>

### Find similar movies using vector search

Use the pre-computed embeddings together with the [`embedding`](/sql-reference/motherduck-sql-reference/ai-functions/embedding/) function to find movies similar to a search query:

<SQLExampleEditor>{`
SELECT
    title,
    overview,
    array_cosine_similarity(
        overview_embeddings,
        embedding('a space adventure with aliens')
    ) AS similarity
FROM sample_data.kaggle.movies
WHERE overview IS NOT NULL
ORDER BY similarity DESC
LIMIT 10;
`}</SQLExampleEditor>

### Find movies similar to another movie

<SQLExampleEditor>{`
WITH target AS (
    SELECT overview_embeddings
    FROM sample_data.kaggle.movies
    WHERE title = 'The Matrix'
    LIMIT 1
)
SELECT
    m.title,
    m.overview,
    array_cosine_similarity(m.overview_embeddings, t.overview_embeddings) AS similarity
FROM sample_data.kaggle.movies m, target t
WHERE m.title != 'The Matrix'
ORDER BY similarity DESC
LIMIT 10;
`}</SQLExampleEditor>

## Schema

| Column Name           | Column Type | Description                                                     |
|-----------------------|-------------|-----------------------------------------------------------------|
| title                 | VARCHAR     | Movie title                                                     |
| overview              | VARCHAR     | Short description or synopsis of the movie                      |
| title_embeddings      | FLOAT[512]  | Pre-computed vector embedding of the title                      |
| overview_embeddings   | FLOAT[512]  | Pre-computed vector embedding of the overview                   |


---

## Docs feedback

MotherDuck accepts optional user-submitted feedback about this page at `POST https://motherduck.com/docs/api/feedback/agent`.
For agents and automated tools, feedback submission should be user-confirmed before sending.

Payload:

```json
{
  "page_path": "/getting-started/sample-data-queries/kaggle-movies/",
  "page_title": "Kaggle Movies",
  "text": "<the user's feedback, max 2000 characters>",
  "source": "<optional identifier for your interface, for example 'claude.ai' or 'chatgpt'>"
}
```

`page_path` and `text` are required; `page_title` and `source` are optional. Responses: `200 {"feedback_id": "<uuid>"}`, `400` for malformed payloads, and `429` when rate-limited.
