Where Data Science Meets Shrek: How BuzzFeed uses AI
2025/01/21Leveraging AI for Creative Content at BuzzFeed
BuzzFeed's data team, led by Gilad, has integrated large language models and generative AI capabilities into their products and toolkits to enhance creative processes rather than replace human workers. The team focuses on using AI advances to build tools and create experiences that wouldn't be possible otherwise, enabling readers to participate more deeply in media experiences.
Text Understanding Through Embeddings
The Challenge of Content Classification
Understanding what content is truly about goes beyond simple named entity recognition. BuzzFeed publishes diverse content that often defies traditional categorization - like an article about an Indiana woman who brought her pet raccoon to a fire station fearing it had overdosed on marijuana. These unique stories require sophisticated understanding beyond standard topic taxonomies.
From Universal Sentence Encoders to Modern Embeddings
BuzzFeed transitioned from using Google's Universal Sentence Encoders and DistilBERT (which maintained 97% of BERT's performance while being 60% smaller) to modern embedding approaches. They now use Nomic embeddings with 124 million parameters to create dense vector representations of their content.
These embeddings serve as foundational infrastructure for:
- Content clustering and similarity detection
- Recommendation systems
- Trend analysis and business intelligence
- Understanding audience consumption patterns
AI-Powered Visual Content Creation
The Viral Barbie Dreamhouse Experiment
BuzzFeed's editorial team began experimenting with Midjourney to create images relevant to trending topics. Sarah, a content creator, researched and crafted prompts representing each U.S. state to generate "Barbie's Dreamhouse in Every State." The post went viral organically on Instagram and TikTok, with one TikTok video alone garnering over 13 million views.
Building Custom Image Generators
Following early successes, BuzzFeed developed in-house capabilities using:
- Stable Diffusion XL (SDXL) models
- Low-Rank Adaptation (LoRA) for efficient fine-tuning
- APIs like Replicate for model hosting
This technology powers interactive generators including:
- Shrek Generator: Allows users to "Shrek-ify" anyone
- Moo Deng Generator: Creates images of the viral baby hippo in various scenarios
- Mormon Wives Generator: Face-swaps users into themed content
- Medieval Pet Generator: Transforms pet photos into medieval artwork
- AI Emoji Contest: Enables users to create custom emojis
Data-Driven Content Optimization
Historical A/B Testing Data as Training Material
BuzzFeed has collected years of A/B testing data from their Bayesian-based testing system, which evaluates different headline and image combinations. This historical dataset includes:
- Multiple headline variants
- Performance metrics (click-through rates)
- Winning combinations
AI-Powered Headline Generation
The team trained a model using Hugging Face's Accelerate and Transformers library to predict winning headlines. The workflow involves:
- Generating 16 candidate headlines using Claude
- Running headlines through the trained model
- Using a "Battle Royale" style competition where groups of four compete
- Identifying the predicted best-performing headline
This approach combines BuzzFeed's unique historical data with large language model capabilities to create diverse, high-performing headlines while helping writers learn new approaches.
The Future of AI in Media
BuzzFeed views current AI applications as the beginning of a significant transformation in media creation and consumption. Rather than using these advances to create "subpar versions of the same static, unimaginative things," the company focuses on discovering unique possibilities within this new medium. Their philosophy parallels early television broadcasters who initially just recorded plays before discovering the medium's true potential.
The team's approach emphasizes:
- Continuous learning and testing
- Relentless iteration
- Finding unique applications for AI technology
- Creating interactive experiences that engage audiences
Early experiments have proven successful, with users actively engaging with these AI-powered interactive formats, demonstrating strong audience appetite for innovative content experiences that blend human creativity with AI capabilities.
CONTENT
- Leveraging AI for Creative Content at BuzzFeed
- AI-Powered Visual Content Creation
- Data-Driven Content Optimization
- The Future of AI in Media
CONTENT
- Leveraging AI for Creative Content at BuzzFeed
- AI-Powered Visual Content Creation
- Data-Driven Content Optimization
- The Future of AI in Media
Related Videos

2025-07-26
How to Efficiently Load Data into DuckLake with Estuary
Learn how DuckLake, MotherDuck, and Estuary enable fast, real-time data integration and analytics with modern open table formats, cloud data warehousing, and no-code streaming pipelines.
YouTube

20:44
2025-06-13
What can Postgres learn from DuckDB? (PGConf.dev 2025)
DuckDB an open source SQL analytics engine that is quickly growing in popularity. This begs the question: What can Postgres learn from DuckDB?
YouTube
Ecosystem
Talk

2025-06-12
pg_duckdb: Ducking awesome analytics in Postgres
Supercharge your Postgres analytics! This talk shows how the pg_duckdb extension accelerates your slowest queries instantly, often with zero code changes. Learn practical tips and how to use remote columnar storage for even more speed.
Talk
Sources