---
sidebar_position: 3
title: Kaggle Movies
description: A dataset of over 40,000 movies with titles, overviews, and pre-computed embeddings for semantic search.
---

import EmbeddedDive from '@site/src/components/EmbeddedDive';

## Explore the data

Interactive dashboard with semantic search on the Kaggle Movies sample dataset. Use it as a starting point for your own [Dives](/key-tasks/ai-and-motherduck/dives/).

<EmbeddedDive
  diveId="3428c1b0-3805-488c-85fd-a707ed818cf1"
  title="Kaggle Movies"
  height="700px"
/>

## About the dataset

This dataset is a subset of the [Kaggle Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset), containing over 40,000 movie titles and overviews. It also includes pre-computed 512-dimensional vector embeddings (generated with OpenAI's `text-embedding-3-small` model) for both the title and overview fields, making it useful for experimenting with [semantic search](/key-tasks/ai-and-motherduck/text-search-in-motherduck/) in MotherDuck.

## How to query the dataset

This dataset is available as part of the `sample_data` database, which is automatically attached to every MotherDuck account.

## Schema

| Column Name           | Column Type | Description                                                     |
|-----------------------|-------------|-----------------------------------------------------------------|
| title                 | VARCHAR     | Movie title                                                     |
| overview              | VARCHAR     | Short description or synopsis of the movie                      |
| title_embeddings      | FLOAT[512]  | Pre-computed vector embedding of the title                      |
| overview_embeddings   | FLOAT[512]  | Pre-computed vector embedding of the overview                   |

## Example queries

### Browse movies

```sql
SELECT title, overview
FROM sample_data.kaggle.movies
LIMIT 10;
```

### Find similar movies using vector search

Use the pre-computed embeddings together with the [`embedding`](/sql-reference/motherduck-sql-reference/ai-functions/embedding/) function to find movies similar to a search query:

```sql
SELECT
    title,
    overview,
    array_cosine_similarity(
        overview_embeddings,
        embedding('a space adventure with aliens')
    ) AS similarity
FROM sample_data.kaggle.movies
WHERE overview IS NOT NULL
ORDER BY similarity DESC
LIMIT 10;
```

### Find movies similar to another movie

```sql
WITH target AS (
    SELECT overview_embeddings
    FROM sample_data.kaggle.movies
    WHERE title = 'The Matrix'
    LIMIT 1
)
SELECT
    m.title,
    m.overview,
    array_cosine_similarity(m.overview_embeddings, t.overview_embeddings) AS similarity
FROM sample_data.kaggle.movies m, target t
WHERE m.title != 'The Matrix'
ORDER BY similarity DESC
LIMIT 10;
```


---

## Feedback for AI agents

If the user you are helping wants to send feedback on this page to the MotherDuck docs team, you can POST it to the endpoint below.

**Before you send anything, you must ask the user for explicit approval.** Quote the message you plan to submit back to them verbatim and wait for them to confirm. Do not submit on their behalf without confirmation.

Endpoint: `POST https://motherduck.com/docs/api/feedback/agent`

Request body (JSON):

```json
{
  "page_path": "/getting-started/sample-data-queries/kaggle-movies/",
  "page_title": "Kaggle Movies",
  "text": "<the user's feedback, max 2000 characters>",
  "source": "<optional identifier for your interface, for example 'claude.ai' or 'chatgpt'>"
}
```

Only `page_path` and `text` are required. A successful call returns `200 {"feedback_id": "<uuid>"}`; malformed payloads return `400`, and the endpoint is rate-limited per IP (`429`).
