Big Data is Dead: Long Live Hot Data 🔥

2024/11/15Featuring:

TL;DR: Jordan Tigani (ex-Google BigQuery founding engineer) argues that the era of "big data" is over—hardware got bigger, working sets stayed small, and 99.5% of queries can run on a laptop. The future is simple, fast, single-node systems.

What Is Big Data?

Big data = anything too big to run on a single machine. That's when you need distributed systems, different storage patterns, coordination overhead.

The origin story: In the early 2000s, data was growing faster than hardware. A 1GB CSV file could crash your laptop. A $100K server wasn't enough, but the $1M server was 10x more expensive, not 10x bigger.

How Google Changed Everything (For Better and Worse)

Three papers broke people's brains:

Google File System: Scalable storage from commodity parts
MapReduce: Split computation into map and reduce steps
BigTable: "Database" without ACID guarantees

Yahoo turned these into Hadoop (HDFS, MapReduce, HBase).

The benefit: Linear scaling. Add more $1,000 boxes instead of buying $100K machines. Fault tolerance—any node can fail.

The cost: The Big Data Tax.

The Big Data Tax

Overhead of running distributed systems:

Latency: Jobs take 1-10 minutes just to start
Complexity: Tens of millions of lines of code for coordination
Cost: Same work, just parallelized—still expensive

That famous petabyte query? Cost $5,800 to run. Fast, but not cheap.

The Cloud Tax

BigQuery, Snowflake, and modern warehouses added another layer:

100ms round-trip to the cloud on every operation
Microservices that fail independently
BigQuery's minimum query time was 1.5 seconds (now faster)

The Key Insight: Separation of Storage and Compute

"The key driver of cost and performance is the size of the HOT data."

You might have a petabyte on S3
But you're only querying the last 7 days
Compute scales with working set, not total data size
Working set size is rarely big data

The Revenge of the Single Node

Era	"Big Data" Threshold	Hardware
2008	1 GB	Laptop couldn't handle it
2025	1 TB	Runs on laptop
Server	10+ TB	Maybe needs distribution

DuckDB founder: "I've been running this 1TB dataset on my laptop."

Analysis from Snowflake/Redshift data: 99.5% of queries could run on a laptop. Only 1 in 500 users run actual "big data" queries.

Small Data Principles

Latency matters? → Do work close to the user
Cost matters? → Do work where it's cheapest
Simplicity > Scalability → Premature scaling is premature optimization
Simpler is faster → DuckDB's secret: no distributed coordination
Think outside the cloud → Combine local, edge, and cloud

The Spark vs. Snowflake 100TB Benchmark Drama

Databricks bragged about fastest 100TB benchmark. Snowflake called them cheaters. Tigani's reaction:

"At BigQuery, literally nobody runs queries that big. Our largest customers queried 1-10TB at most."

Reactions to "Big Data Is Dead"

Hacker News: "You're wrong" (they say that about everything)
Databricks execs: Strongly disagreed
Most people: "I agree, but I have big data" (narrator: they didn't)
Some people: "Thank you—this validates my experience"

"These are our people. This is why we're doing this conference."

TABLE OF CONTENTS

What Is Big Data?

How Google Changed Everything

The Big Data Tax

The Cloud Tax

The Key Insight: Separation of Storage and Compute

The Revenge of the Single Node

Small Data Principles

The Spark vs. Snowflake 100TB Benchmark Drama

Reactions to "Big Data Is Dead"

Transcript

0:00[Music]

0:16morning everybody thanks for uh thanks for coming out on this uh Sunny San Francisco morning uh the idea behind the small data conference was uh credited to actually Bob the the CEO of we8 we were at a uh uh VC conference somewhere and he said what do you think of like doing a small data conference everybody talks

0:38about Big Data conferences and you know runs around but they're always so serious like we could do something fun and something a little bit different and we started thinking about it and being like you know I bet we could get a really good uh class of speakers you know people who are who would just love to be able to talk about what is

0:55possible when you know the scale of your data isn't actually the most most important feature of it like well if we can get really good good speakers I bet we could get really good uh attendees people to come out uh really interesting folks who kind of believe in this uh small data Mission and uh and come and

1:14join us and so you know I think it was about a year ago and um you know and but but this the idea behind this was born and you know this is the first the first one uh we don't know if there's going to be another one so if you are enjoying it please you know let us know and uh uh we

1:31do actually already have some people who are like who you know speakers actually that that we didn't have space for this time uh or weren't able to make it and like we'd love to do love to do next year so um maybe we will do next year but only if we get kind of enough um enough excitement um and uh and feedback

1:48from people um so today today I'm going to talk a little bit about kind of the the background around small data and you know its foil um obviously is uh is is big data it's an area that I've been working in for the last 15 years or so now and um and something I've been sort of interested

2:09in for for a long time of course I was sort of on the the other side of the uh of of the uh of the equation for for several years but it it really did start out there like when um before before Big Data was a was a thing there was a real problem data sizes were growing and you

2:29know people had a hard time actually being able to to handle those and one of the one of the problems or and before I get into that actually want to talk about just about what is Big Data uh because it helps to actually have a shared shared definition and you know people kind of bat around you know

2:45various ideas about what Big Data means but to me it's anything that's too big to run on a single machine uh because that that's when you have to do things differently that when when you have to sort of change how you're how you're performing your your computations how you're storing data how you're moving data uh and in a couple of examples you

3:03know from the early days uh I had a job interview once and uh and they said well part of the job interview is it's sort of this take on problem and I want you to like process this uh and get some result from this you know giant giant CSV file uh that was it was one gigabyte and um you know that sounds like kind of

3:23laughably small these days but actually I wasn't able to get it to run on my laptop and they gave me like a sample of it that was that was like only only 10% uh so I could run it on my on my laptop now that company um was a small startup company they had 20 people and um but

3:41they were processing huge amounts of data per day at that time 100 gigabytes of data was was a massive amount of data they had their pixel so they you know basically 10% of the internet traffic uh had like their their advertising pixel on it and so they were able to get logs from 10% of all the internet and uh and

3:58that was 100 100 GB a day like fantastically large amount of amount of of data and so one of the lessons that I learned from from working at that company was like Hey even small companies have big data big data is going to be a thing for everyone and so the the the problem that that we were

4:16seeing was really that the uh the costs of Hardware were nonlinear so if you needed um if you ran out of space on your on your single single machine your single server machine uh and you needed to be able to put more discs in it you had to buy a bigger a bigger machine but that machine was a

4:36lot more complicated and it was a lot more expensive so it wasn't twice as expensive it was 10 times as expensive and then when you ran out of you know you weren't able to stuff any more discs or memory or CPUs into that into that machine you bought had to buy the next one and that was 10 times as expensive

4:51and the next one was 10 times as expensive and you got to a point where actually you couldn't even you couldn't even grow any bigger and um you know a lot of companies were making a ton of money bu you know selling these incredibly beefy beefy hardware systems so Google sort of turned this whole thing on its head and I think they ended

5:07up kind of breaking people's brains in how they thought about about data and scaling uh with a series of three papers that came out um the first one was Google file system Google file system kind of let you have like a storage area network or you know uh uh giant scalable

5:26storage system built out of commodity parts so built out of inexpensive inexpensive pieces so you didn't have to buy those really expensive uh those expensive machines then map produce came out uh and they argued that you could you know separate every computation into a series of two steps a map step and a reduce step and if you did that then you

5:50know we could you could scale those out uh really really efficiently and they were able to do so at Google uh and then lastly was big table big table is sort of you know hey you can have a scalable database as long as you kind of squint and kind of uh don't really need some of the things that a database uh usually

6:07needs like you know assd transactions consistency uh Etc but you know people were running you know banging their heads into the wall and um and and these things all sounded like sounded like good uh good outcomes and all this stuff got wrapped up um by some some folks I think at Yahoo and Doug cutting uh and they they kind of turned that into um

6:30the open source version of that was was Hadoop and the kind of the open source kind of the Hadoop version of uh GFS was hdfs the the open source version of of uh M produ was was kind of the main Hadoop uh execution and then um the the on the database side on the big table side was was effectively effectively

6:50h-based but these really just kind of changed how people built and scaled scaled systems and so you know kind of if you think of it in a little bit of a kind of a stylized architectural way instead of buying boxes single boxes single nodes uh that were bigger you just sort of added more added more machines and so like a medium cluster

7:13you know essentially had two of these boxes and so $1,000 boxes was $2,000 a larger one that had six was just $6,000 uh so it it helped incredibly with with kind of with cost and then also you could you could effectively scale this up infinitely so you know the advantage is where you get linear scaling you get fall

7:33tolerance because actually one of the problems um you know actually if you have like one of these like $100,000 machines is what happens when that $100,000 machine fails then you need a second $100,000 machine for uh for for uh for for a backup the nice thing about this is any of these machines can fail and you could just replace them um and

7:53they're really good at throughput uh and of course there's there was a trade-off between throughput and latency so you could you could get a done in a certain amount of time you could you could send a lot of data through these but latency was bad so like if you wanted to you know uh get an answer it might take it

8:09might take you minutes and the other thing was complexity there was a lot of additional uh mechanisms that had to go along with with this you know tens of millions of lines of code to make it to make it actually work kind of this introduced The Big Data tax and the Big Data tax is you know sort of defining as

8:26the overhead of running Big Data Systems and you know the uh the areas where they show up are latency you know the uh things slow down system complexity there's a lot of more moving parts and you know while this you know in this example it did reduce some of the costs as we'll see in a little bit it doesn't

8:46actually change the cost structure because you're still doing the same amount of work or or even more work so for an example what the hadig data data tax is I won't go through all these but there's just a bunch of stuff that has to happen that's not actually doing the work so none of these steps are doing

9:03are like doing the work that you're trying to solve um they're basically just all these sort of Machinery pieces and what that would mean is like for a Hadoop job it would take one to 10 minutes so you might say well nobody's using Hadoop these days and kind of Hadoop was the bad old days and it was

9:17sort of this dead end uh technological uh technological dead end and um so Google came out with two new papers uh Dremel and Dremel became big query and spanner and spanner became you know Google Cloud spanner and um you know these really said hey we're going to take some of these underlying systems uh the same ideas and we're going to

9:39actually build things that look closer to what you expect like spanner is an actual real asset transactional database uh Dremel became kind of an actual uh cloud cloud data warehouse but we also added a new tax so when you move everything to the cloud there's a cloud tax which is essentially um you know maybe it's not as big uh or as big of a

9:59deal but everything you do uh you know you have a transition to the cloud and back so you're you're adding 100 milliseconds may not seem like enough but a lot but you know in um you know in the transactional database world where you're kind of doing used to doing thousands of transactions per second um it does it does that up and it makes

10:17certain architectures not work anymore and so in big query you know there's just tons and tons of pieces inside big query there all these like microservices and all of these things can fail and they roll out independently and um so there's a complexity cost and then you know at one point the minimum query time on big query was was 1 and a half

10:35seconds it's much faster now but uh as just sort of an example of the kinds of things that happen when you uh have to put all these pieces together and then of course the big data attacks didn't go away the fact that you had to deal with all of these like complex distributed systems was still around uh it's you

10:52know it cost you know often seconds and uh and there was also the complexity cost in the lines of code um so the one of the really really important things in um in these scaling systems in these Big Data Systems was was the separation of storage and compute and in the early days snowflake used to basically just advertise their

11:13sales pitch was like we've got separation of storage and compute and that's like that's all you need and um you know they claimed to have invented it then we were like well you did it two years after uh after we did in big query and then some other people like uh I think IBM were like no actually we've

11:29been doing this for like a decade but regardless of who actually invented it um you know the these sort of modern modern systems uh rely on it very heavily and there's a couple of interesting points first of all like once you you know the uh the you separate storage and compute you basically throw all your storage in this

11:47giant almost Limitless object storage and so kind of the example there is is S3 you don't really have to worry about how much data you have on S3 it all just sort of it all just sort of works now the compute cluster um there's some interesting things that then happen is that you know big data and the from the

12:06scaling perspective um the storage tended to grow faster than the compute did there are the requirements that you needed to store the data were a lot larger than the requirements for for the computation over data and this actually is where the sort of the seeds of the hey we don't may maybe we don't need Big Data Systems anymore uh grew grew out of

12:28um so anyway bought the Big Data thing hul hookline and sinker so I was at the startup company mindset media and was you know like yes we've got big data so everybody's going to have big data everybody's going to be Google scale um you know that may or may not have happened but I was running around the

12:45world kind of giving talks uh and uh one of the things that I would do is uh I had this this thing that I would do I would I would run a paby query against big query so I would scan this whole this whole petabyte and um and that was pretty revolutionary at the time I think the first time I did it I got you know a

13:04a round of applause like in mid midt talk that doesn't happen a lot in data data talks um and uh thank you thank you uh that uh um but

13:17um but that was that was pretty cool and I'll get back to that in a second because you know pretty much once you know we these these systems started being you know coming out you started to to to see some some cracks and um so for example if you were

13:33going to pick a transactional database um what would you what would you use anybody SQL light SQL light is great any anybody else postest yes I I I bet it rhymes with uh with with tow scrats but SQL light is is also a fantastic example um you know generally in no ltp these days um big data is irrelevant uh you know no

13:58SQL kind of which kind of was everybody was so excited about like hey we don't need trans we don't need consistency uh we don't need real transactions and and um you know I think uh all of those sort of um nosql databases this is the sort of DB engines ranking and it's quite pretty hard to see but kind of postgress

14:18keeps going up and kind of everybody else is sort of uh hasn't really gotten Too Close actually has done has done pretty well um but I think there's just a lot of energy behind and and for new for new things I think you know people generally are using postgress and there's also a class of new SQL databases that you know things

14:37like spanner um things like cockroach uh things like Aurora and it turns out that most of the time you don't have that you don't have the scale that actually needs that postgress is fine all right if you're going to pick a data warehouse for a new Green Field workload uh on the o on the olap side uh which one would

14:54you use well you know uh actually see

14:59we're hoping you'd say duck DB but but on the other hand you know the actually the Big Data Systems are still still going pretty strong in olab you know snowflake uh had its IPO in 2020 um snowflake is still is still grow going strong um but you know there's there are some some uh some systems are catching

15:17up um this is sort of the DB ranking you can see that um that uh that kind of duct DB has the same sort of shape of the curve as uh as snowflake just uh translated by by a couple couple years um and so getting back to that one petabyte query so that paby query you know poses processing 1.09 pedabytes the

15:38thing that I didn't tell you when I was running this query on stage was that cost $5,800 to run so we could make it fast but we couldn't make it inexpensive and so I think that's sort of one of the key key things is like if if the way you're scaling is just by you know running this

15:57doing the same amount of work in in uh in parallel then um over time as as as

16:03you let the size size grow your costs are also going to grow and there's some point at which people like well I just don't want to pay that much uh or it isn't isn't worth it maybe I should preaggregate maybe I should get rid of some of this data um so yeah it was a thousand times more work than than it

16:20would have been to read a terabyte um and so very often except in rare occasions uh it's too expensive to do most of your competition computations over big data so you know there is another kind of interesting piece that sort of this uh the outcome of this the separation of storage and compute which is that the key driver of your cost and performance

16:45uh in these Big Data Systems that have separation storage and compute is the size of the hot data so if you're querying over if you might have a paby of data but if you're actually only querying over 100 megabytes um then

16:59that's all the compute you need so um and that tends to be the working set size tends to be much much smaller than the overall data size uh and the compute size tends to scale with that size uh and then the total data Size Doesn't Matter like sometimes people say like well I've got this huge data warehouse and it's like

17:19well but how much of that do you actually use and how big are how big are your actual actual queries because the fact that you've got a petabyte of log sitting on dis doesn't matter if all you're looking at is the last s days um so and then the data working set size is rarely big data and that's that's why

17:34it's actually cost efficient and cost effective to to do your computations so what can we do about this well we see sort of the Revenge of the single node so instead of having these multiple independent boxes you can just have one big box and over time the size of Hardware has increased by you know a c since I was sort of doing this 100 100

17:55uh or gigabyte uh data set on my lap laptop you know sizes of data are or sizes of machines are two orders of magnitude larger um so General your working set can figure fit in a single node U and if you're in a single node you don't have to pay the Big Data tax there's not all this coordination

18:12overhead there's not this all this complexity uh and you can move faster you can focus on kind of the important things for your application or important things for what you're doing than uh if you had to uh you know build these complex distributed systems so data has turned out to be not as big as everybody thought um and uh cost you

18:34know running things in parallel doesn't make them less expensive um the PAB Curry was was $5,000 and as Hardware got bigger you know we don't necessarily have to pay the pay the Big Data tax so the today how big actually is Big Data um while I was talking to the founder of dub Labs yesterday and he was saying well I've

18:52been running this like one tab data set uh on my laptop um so clearly kind of a terabyte you can still run on your laptop and that's pretty pretty huge that's three orders of magnitude bigger than it used to be uh and on the server side um you know uh you know roughly kind of if you're if it's larger than 10

19:12terabytes uh it there's probably a good chance you you'd want to use some Big Data Systems but so that's also one of the interesting things is that you just wait a little while and kind of the threshold for what is Big Data is going to go further and further because uh you know um these these scaling increases aren't aren't

19:31stopping anytime soon um so a little later we're going to talk with uh with George Fraser the CEO of f Tran who did some really interesting analysis work on um on uh

19:44this red red set data set uh that kind of showed how much um data people were using as well as there's a snowflake data set um but one of the outcomes of what he was saying is hey 99.5% of these queries could run on a laptop um and then then you know some analysis I did was that you know only

20:01one in 500 users um run queries are actually big data so if you're designing a system for the Post Big Data world like what are some different things well instead of like having these sort of like everything runs in the cloud you can actually do work on your laptop uh you know in big query we said you know

20:19hey if they have big data you want to move the compute to the data um because it's so expensive to move the data but you know hey if you're not doing massive amounts of data why don't you run that on your laptop so some small data principles just kind of things that we've we've learned so if latency is

20:35important do the work as close to the user as possible uh and so we have actually some AMA you know AMA is here um and it's like hey you know if your latency is important for your machine learning models and for inference you know why why do you have to go back and forth to like these gpus in the cloud

20:51why not use the gpus that are that are local um when cost is important do the work where it's least expensive so maybe that's in the cloud maybe that's on the edge maybe it's a different region maybe you can find but you can actually move the data to where it's it's effective to

21:08run Simplicity is better than scalability so kind of you know they say in in uh in you know software engineering premature optimization is the root of all bugs uh kind of premature scaling is kind of the root of all like over over complications uh we've got the turo folks here with who are kind of productionizing SQL light as

21:26a is a great great example of that simpler is faster I think this is one of the key key things of duct Deb uh and why it's so fast is like Hey we're going to start with something simple we're not going to build this elaborate distributed system and um and we're just going to focus on making it great and

21:43then finally think outside the cloud it's not just hey I'm going to run everything in the cloud once you start building these architectures they can combine local they combine Edge they can combine some other pieces you can come up with really really interesting architectures and really different ways of doing doing things um and kind of the thing that got

22:02me onto this was you know when data bricks did this uh you know they just did did some some benchmarking and they were bragging about how they they had the fastest tpcd result on 100 terabyte data set and snowflake said um well you know you're you're you're cheating and then data break or then um datab break said no no you were lying and it was

22:24just sort of this sort of like crazy fight but the the thing that that was hilarious to me was actually that um the data size they were using was 100 terabytes and for my time at Big query literally nobody runs queries that big like are we had some of the largest customers in the world and they were two

22:42ERS of using you know one to TS of magnitude at on their largest on their largest query and so it said hey there's something there's something weird here so I published big this big data is dead blog post um and uh I got a lot of feedback from that and um the it started out with like no you're wrong um it

23:03turns out actually not that many people said that and it was mostly sort of Hacker News which I think they will say that about anything uh and a handful of data brecks data brecks execs who were um uh convinced that I was on on

23:17completely wrong but um there was also a lot of people said I totally agree with what you said but like I've got big data and I think this is the whole like working set versus total data size policy and uh and most of these people probably don't have big data but I don't want to fight them over it uh there's

23:34also some people like duh like big data was never really you know a real thing um fine you know but I'm happy you agree like we can all just agree um and then uh a handful of people said Thank you like this validates the experience that I've been having and people haven't been saying this and I'm so glad you're

23:53you're saying this and so these are our people and you know kind of these are kind of what these people people are actually what kind of is behind wanting to do this um this conference so you know why did we why did we we come up with this conference we really want this to be a kind of a collaboration we

24:11didn't want to have a giant event that was like hey we can fill a giant you know Auditorium or whatever we wanted to have really really awesome amazing people who would have a great uh dialogue and have interesting things to say and um and provide opportunity for them to to meet each other exchange ideas um how many people here are

24:33startup Founders there's a ton of startup Founders in this room so a ton of ideas you know behind people that are putting together new new tools and and Technologies and um and that sort of is is super exciting so um with that I just want to say thank thank you uh long live long live small data and uh really we

24:56want to hear hear from you um you know what uh what do you like about this uh and then please please do like talk to each other talk to each other after this and hopefully we can we can forward some new bonds that build a a movement around around small data thank you

25:16[Applause]

FAQS

What does 'big data is dead' actually mean?

Jordan Tigani, MotherDuck CEO and former founding engineer of Google BigQuery, defines big data as anything too big to run on a single machine. His argument is that while data sizes have grown, hardware has grown faster. Single machines now have 100x more capacity than a decade ago. Analysis of Redshift usage data shows that 99.5% of queries could run on a laptop, and only 1 in 500 users actually has big-data-scale workloads. Most organizations are paying the overhead of distributed systems for data that fits on a single node.

What is the big data tax and why does it matter?

The big data tax is the overhead of running distributed data systems, including latency from network coordination, system complexity from managing multiple nodes, and the cost of shuffling data between machines. None of these steps do actual analytical work; they are pure machinery overhead. Even modern cloud data warehouses like BigQuery and Snowflake carry this tax through cloud latency, microservice coordination, and complex distributed query execution. Single-node systems like DuckDB eliminate this tax entirely.

What is hot data and why is it the key driver of cost?

Hot data is the working set of data you actually query regularly, as opposed to your total stored data. In systems with separation of storage and compute, your cost and performance are driven by the size of this hot data, not by the petabytes sitting on disk. Most organizations store far more data than they actively query. If you only look at the last 7 days of logs, it does not matter that you have a petabyte total. This is why a single-node approach works: the data working set is rarely big data.

What are the key principles of small data architecture?

Small data principles include: do work as close to the user as possible when latency matters, do work where it is least expensive when cost matters, choose simplicity over scalability since premature scaling causes over-complication, and think outside the cloud by combining local, edge, and cloud compute. DuckDB follows these principles by focusing on making a single-node system great rather than building elaborate distributed infrastructure. Learn more about getting started with DuckDB.

How big does data need to be before you actually need a distributed system?

According to Jordan Tigani, the threshold for needing distributed big data systems is roughly 10 terabytes or more on the server side. On a laptop, you can now comfortably work with a terabyte of data, three orders of magnitude more than was possible a decade ago. This threshold continues to rise as hardware improves. Analysis of Redshift data showed that 90% of clusters had scan sizes under 1 terabyte, and 30% had data under 1 gigabyte, yet they were all running on full distributed clusters.

Big Data is Dead: Long Live Hot Data 🔥

What Is Big Data?

How Google Changed Everything (For Better and Worse)

The Big Data Tax

The Cloud Tax

The Key Insight: Separation of Storage and Compute

The Revenge of the Single Node

Small Data Principles

The Spark vs. Snowflake 100TB Benchmark Drama

Reactions to "Big Data Is Dead"

Transcript

FAQS

What does 'big data is dead' actually mean?

What is the big data tax and why does it matter?

What is hot data and why is it the key driver of cost?

What are the key principles of small data architecture?

How big does data need to be before you actually need a distributed system?

Related Videos

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

The Unbearable Bigness of Small Data

In the Long Run, Everything is a Fad