YouTube

Where Data Science Meets Shrek: How BuzzFeed uses AI

2025/01/21

Leveraging AI for Creative Content at BuzzFeed

BuzzFeed's data team, led by Gilad, has integrated large language models and generative AI capabilities into their products and toolkits to enhance creative processes rather than replace human workers. The team focuses on using AI advances to build tools and create experiences that wouldn't be possible otherwise, enabling readers to participate more deeply in media experiences.

Text Understanding Through Embeddings

The Challenge of Content Classification

Understanding what content is truly about goes beyond simple named entity recognition. BuzzFeed publishes diverse content that often defies traditional categorization - like an article about an Indiana woman who brought her pet raccoon to a fire station fearing it had overdosed on marijuana. These unique stories require sophisticated understanding beyond standard topic taxonomies.

From Universal Sentence Encoders to Modern Embeddings

BuzzFeed transitioned from using Google's Universal Sentence Encoders and DistilBERT (which maintained 97% of BERT's performance while being 60% smaller) to modern embedding approaches. They now use Nomic embeddings with 124 million parameters to create dense vector representations of their content.

These embeddings serve as foundational infrastructure for:

  • Content clustering and similarity detection
  • Recommendation systems
  • Trend analysis and business intelligence
  • Understanding audience consumption patterns

AI-Powered Visual Content Creation

The Viral Barbie Dreamhouse Experiment

BuzzFeed's editorial team began experimenting with Midjourney to create images relevant to trending topics. Sarah, a content creator, researched and crafted prompts representing each U.S. state to generate "Barbie's Dreamhouse in Every State." The post went viral organically on Instagram and TikTok, with one TikTok video alone garnering over 13 million views.

Building Custom Image Generators

Following early successes, BuzzFeed developed in-house capabilities using:

  • Stable Diffusion XL (SDXL) models
  • Low-Rank Adaptation (LoRA) for efficient fine-tuning
  • APIs like Replicate for model hosting

This technology powers interactive generators including:

  • Shrek Generator: Allows users to "Shrek-ify" anyone
  • Moo Deng Generator: Creates images of the viral baby hippo in various scenarios
  • Mormon Wives Generator: Face-swaps users into themed content
  • Medieval Pet Generator: Transforms pet photos into medieval artwork
  • AI Emoji Contest: Enables users to create custom emojis

Data-Driven Content Optimization

Historical A/B Testing Data as Training Material

BuzzFeed has collected years of A/B testing data from their Bayesian-based testing system, which evaluates different headline and image combinations. This historical dataset includes:

  • Multiple headline variants
  • Performance metrics (click-through rates)
  • Winning combinations

AI-Powered Headline Generation

The team trained a model using Hugging Face's Accelerate and Transformers library to predict winning headlines. The workflow involves:

  1. Generating 16 candidate headlines using Claude
  2. Running headlines through the trained model
  3. Using a "Battle Royale" style competition where groups of four compete
  4. Identifying the predicted best-performing headline

This approach combines BuzzFeed's unique historical data with large language model capabilities to create diverse, high-performing headlines while helping writers learn new approaches.

The Future of AI in Media

BuzzFeed views current AI applications as the beginning of a significant transformation in media creation and consumption. Rather than using these advances to create "subpar versions of the same static, unimaginative things," the company focuses on discovering unique possibilities within this new medium. Their philosophy parallels early television broadcasters who initially just recorded plays before discovering the medium's true potential.

The team's approach emphasizes:

  • Continuous learning and testing
  • Relentless iteration
  • Finding unique applications for AI technology
  • Creating interactive experiences that engage audiences

Early experiments have proven successful, with users actively engaging with these AI-powered interactive formats, demonstrating strong audience appetite for innovative content experiences that blend human creativity with AI capabilities.

CONTENT
  1. Leveraging AI for Creative Content at BuzzFeed
  2. Text Understanding Through Embeddings
  3. AI-Powered Visual Content Creation
  4. Data-Driven Content Optimization
  5. The Future of AI in Media
CONTENT
  1. Leveraging AI for Creative Content at BuzzFeed
  2. Text Understanding Through Embeddings
  3. AI-Powered Visual Content Creation
  4. Data-Driven Content Optimization
  5. The Future of AI in Media

Related Videos

"How to Efficiently Load Data into DuckLake with Estuary" video thumbnail

2025-07-26

How to Efficiently Load Data into DuckLake with Estuary

Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.

YouTube

"What can Postgres learn from DuckDB? (PGConf.dev 2025)" video thumbnail

20:44

2025-06-13

What can Postgres learn from DuckDB? (PGConf.dev 2025)

DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?

YouTube

Ecosystem

Talk

" pg_duckdb: Ducking awesome analytics in Postgres" video thumbnail

2025-06-12

pg_duckdb: Ducking awesome analytics in Postgres

Supercharge your Postgres analytics! This talk shows how the pg_duckdb extension accelerates your slowest queries instantly, often with zero code changes. Learn practical tips and how to use remote columnar storage for even more speed.

Talk

Sources