Monte Carlo Simulations With DuckDB ft. James McNeill

Q: What are the memory and performance tips for running Monte Carlo simulations in DuckDB?

Key findings include: increasing chunk_size in fetch_record_batch speeds up throughput for GPU-hungry models; DuckDB's built-in standard deviation function is much faster than computing it manually in SQL (660ms vs 9930ms); and recursive CTEs can be faster than window functions for sequential simulation logic (27ms vs 45ms). Setting explicit memory limits via DuckDB configuration options helps avoid out-of-memory crashes at the cost of slightly slower execution.

2024/01/26Featuring:

James McNeill

James McNeill talks about Monte Carlo simulations at the DuckDB meetup in Dublin on 23 January 2024! Take random samples of your data to get efficient statistical answers to your questions.

Transcript

0:04okay perfect so hello everyone uh I'm James I'll introduce myself in a second this to talk about Monte Carlo Sims INB uh I'll explain why it's a bit crazy maybe in in a little bit but hopefully it'll be interesting so um first of all like I think a really massive thank you to Marcelo and Mi for putting this

0:23together like I run a data Meetup in Dublin it's really cool to see more technical dat meetups here especially we've got a lot of people in a lot of different companies who use data all the time but if you look at the footprint of meetups in dlin I don't think it represents the size so like really well

0:41done and thank you so much for putting it together it's amazing okay so um what am I going to try to communicate in this basically I'm going to say what is a Munch Cara simulation why are they useful uh why does efficiency really matter in them and then like finally the actual kind of thing that I want to get people to take

1:01away from this is how does duct duct DB compare right now to other python tools that you might use so just a quick question which is anyone who's familiar with what a Monte Carlo simulation is could you could you raise your hand so there's a few people um Okay cool so uh

1:20hopefully I'll explain that well so uh who am I that's me so what do I do

1:27I'm a quantitative data engineer in a company Susana International Group which is a um a trading firm actually just down the road uh I'm I guess I should also put in a disclaimer that this is mostly from a hobby side perspective rather than me representing the company or anything um that's my day job I used to be an engineering manager in a

1:47company called canjura uh canjura actually had duct TB running in production in some of our very important microservices which I was uh lucky enough to be able to implement I guess so that was in August of 2022 for a little bit over a year uh and that was a real success story that's kind of where most of my hands-on experience with du

2:05TB is uh I'm from the theoretical physics background is why the Monte Carlo stuff is may be uh appealing to me and I'm a duct TB fan uh and that's kind of bit so um okay let's talk about what these mon Carlo simulations are right so who here has seen Star Wars movies okay lots of pant so there's this famous

2:27scene in Star Wars episode five where Han Solo is piloting the Millennium Falcon trying to get away from the Empire and C3PO and Princess Leia and chewbaca all on the ship and Han Solo trying to get away drives into an asteroid field and see3 peos the possibility of successfully navigating the asterid field is 3,720 to1 uh and you know looks very concerned

2:52then Han says never tell me the odds or something like that and they all fly away so my question is how does C3PO get to that number right so maybe what C3PO is doing is he's going well let me think about what might happen and I'm going to I know how big asteroids are I know how fast they

3:11move I know how quickly a human can react and so I'm going to make a simulation of my head with all of these probabilities by what might happen so maybe C3PO will go oh well we'll just run into an asteroid or maybe t3p is like oh a TIE fighter will shoot a St and maybe we'll get right to the end and

3:30then we'll go into another asteroid or possibly we'll get away right and so what t3p could do to estimate that probability at the start is basically run through all of these possible simulations in his head and then look at how many times the outcome happened in this case from theum fck and escaping versus uh how many times blew up and so

3:53that is basically what a Monte Carlo simulation is right so here c3p to get that number it's going to simulate what's going to happen and then look for this sort of observable thing at the end um and the reason that it's very helpful is you know there there's a lot of probabilities involved it's kind of complicated to put it all together and

4:12get the mathematical answer right so yeah basically what is a monardo simulation it's kind of an approach for solving very difficult or impossible math problems fundamentally uh it relies on random sampling so if you can see here this isn't my pod at all I put the link down below and this is something to estimate Pi by like throwing dots at a

4:34one by a 1x one square uh and if you as if it was a dark board uh and you can see as you kind of increase the sample you get a denser and denser Distribution on this on this thing and you can use that to generate the area and therefore the PI right this mathematical constant um where is it useful uh well

4:56so I did a bit of Googling and so I kind of made a collage of everything so in the top left you've got theoretical physics uh then cosmology the top in the Middle top right is computer Graphics legal studies over here I should do the pointer legal studies here uh like phod

5:15Dynamics Finance it's kind of everywhere right so uh yeah it's pretty much

5:22everywhere because a lot of the time we've got these very complicated mathematical things we don't know where to solve them but we want to solve them okay why is efficiency important in these murlo simulations well let's think back to that C3PO example we're you know getting all of these samples you know maybe C3PO does it 10,000 times and

5:44comes out that the Millennium Fon survives three in those 10,000 times there's going to be an error associated with How likely that estimate is and that error turns out is this thing called the standard error the mean and you can see that it depends I'm not going to derive this right but it depends on the square root of of the

6:01sample size or n that you're using basically as your sample gets bigger intuitively your guess gets better uh and so we really care about efficiency from that perspective you know in a practical use case what does efficiency mean means we kind of care about a certain amount of accuracy and so if we're running on a cloud to get that accuracy we have to

6:22scale our sample points and spend more so if we're more efficient we can get more sample points maybe without spending quite as much if we're on premise we now have much more resource that we can use but we're often on a fixed time budget so one of the examples that you know monarto Sims are useful for is predicting weather forecasts you

6:41know with the weather forecast you got to go on the news at one o'clock and say what the weather tomorrow is going to be and so if you just have more more efficiency you can start to kind of generate more and more of these samples and so what we really want to do people writing Technologies we want to maximize

6:57the accuracy of the results that we're generating writing I want to minimize the time that it takes to do it another aside that I was just thinking about as I was going through this stuff and I tried to do like some estimations like so super Computing like Monti Carlo Sims are this like very unknown thing in my

7:13opinion but it's like a really huge proportion of um actual like computation

7:21in the world uh and so I found this paper uh over here which is a nature uh nature um astrophysics uh looking at the

7:31kind of emissions projections of Australian you know astrophysics uh workers like research scientists uh and so yeah supercomputing is is a really dominating factor in their carbon emissions and I actually just kind of did a quick estimate so this is from UNICEF in 2018 and so this is the number

7:51of research scientists per million people so if we multiply that out say about half of them are using monp or like using super uh super compus mon Carlo use this number up here for the emissions which is the 20 so we get basically roughly speaking you know when we add some other people who aren't research scientists about 200 million

8:11Tong annually of CO2 from supercomputing uh you know try to kind of keep it in perspective it's about five times what irland does and so you know there's there's other things other than technology that that building efficiency in this sort of field is really helpful for my opinion at least so yeah now let's come back to question so now in a

8:31parallel universe where Hansa really cares about the probability and let's just assume that C3PO is written in Python because I'm a python developer yeah everything's written in Python right so what should C3PO use to generate this Monte caros Sim well if you ask chat GPT chat GPT will say use

8:53nump right um You will also if you start googling around you might find some stuff about number uh you know polers if you Google just fast Stu from python you'll get polers a lot of the time and like kind of my question is well where does duck DP fit in here and I actually at this point need to acknowledge there

9:11is someone in the states who has done this before as in Monte Carlo in Du TB guy called Jacob Matson he's got a project called the modern data stack in a box he's coming at it from a completely different angle in my kind of outside opinion so he's using modern data stack in the Box as a proof concept

9:28for how you can these open source tools together one of the things that Monte Carlo is really good at is generating sample data volume because you don't have to send the database system and you can just generate these random samples so he's kind of viewing it from a a scale perspective like there's a blog post on the DU TB blog that mentions

9:45monic Carlo but it never talks about how actually useful it is right uh so here I'm kind of coming at a different angle so let's say these are are four choices of tools um there'ss of other stuff that we could use and really POS we're not actually choosing a library here we're choosing an implementation in a different language so numai is written

10:04fundamentally a lot of the stuff is written in C uh number actually is a just in time compiler and it kind of compiles python down to machine code uh PO is its execution engine is in Rust and dtb uh is written in C++ okay so that's the context that's the question hopefully everyone kind of understands a little bit about why all

10:25of these questions are important and useful and impactful Okay so so um how are we going to do this Benchmark well I need to say like we're going to we're going to do some situations and you Google it the always the example is to calculate pi so we're going to do that and then um we're going to also simulate

10:45the expected values from playing Roulette in a casino um we're going to compare libraries based on their execution time and memory footprint and our question is not to measure these things directly against each other it's just to see is Du Tob viable um here's some disclaimers I did this benchmarking on my laptop it's quite a nice one uh

11:04it's an M3 Max with 48 kicks of ram um I was kind of strict so I said that any number should have five samples basically associated with so that we have some nice error bars uh and yeah there's some other stuff here about like CPU computation I think the important thing to note is that really this should not

11:22be interpreted as a this is better than that it's it's kind of should does this belong in this discussion effectively that's I want to talk about okay so the P calculation um this is the thing that if I come all the way back to here this is what we're doing we're we're drawing uh n points and we're checking if

11:41they're inside the circle uh so if I come back here uh so we draw you know some set of points between minus one and plus one in our case uh we check if the sum of their squares or actually no we're we're going to draw points from uh uh minus5 to plus5 I guess and minus5 to plus5 in the

12:04y axis I we checking if they're basically inside this radius of one maybe I've messed that up actually now that I think about it but there's there's a formula we can do right you can write this in a SQL query using stuff like generate series I haven't actually written any of the SQL here because it's not an easy way of putting

12:19it in the code but so what I want to talk about now is let's just look at some results so um it's a really simple

12:26operation um and the answer is really clear right so duct DB is rubbish we should all go home uh like so duct DB is taking 20 seconds here on the 1 billion uh amount of data volume versus number which is very quick takes about four and a half seconds n n so these are all these numbers in here uh The Columns are

12:50sample sizes and then the error bars are from this the kind of sampling okay so I'm not you know this is a simple thing and you you would actually expect num to to do well here typically and when you think about what is duct TB good at doing we always talk about it's a database and this is what

13:07you know when I answer when chbt answers the question should I use duct TB it'll probably say no because it's it's got all of this joins and group eyes and all this sort of stuff and little of that is being used here so it's not entirely unexpected that it's not doing so well there's actually some nice things that

13:22we can see as to why it's not going so well so this is uh you know memory consumption as per a python developer so please do not immediately trust these numbers or any interpretations um but this is from a a a memory profiling library and python uh I think it's called memory profiler or something and so over these example

13:45sizes we can see here that ddb is basically using about the same amount of memory each time whereas when let's say NPI scales we see it scales very aggressively and uses up a lot of this available Ram uh similarly if we look at the median you know basically U the shape of these charts we can infert I've

14:02actually got a chart of that here so we can see over time on this AIS how everything kind of comes up apart from DB which I focused here on in this blue line and this is actually something that we'll see and I think it's to do with how duct TB Aggregates and we'll come back to that at the end um so yeah P

14:22case dtb maybe not so good um yeah either billion reporting Point number uh actually uh uh tuum for the case because we're we're doing it on an XY plane so we're generating a t pill each time uh but yeah they're floating point and it's just a uniform distribution Z to one and so like do DB has an inbuilt

14:45function to do that that goes into the standard Library um numpy does something similar I guess uh Pyon has its own thing for yeah like polar I haven't really spent a huge amount of time kind of tuning up but polar doesn't have that sort of thing and so there you need to use uh the numai implementation to

15:03generate the data and then you pass it into Polar if that makes sense Okay cool so our second case is this rette simulation um this these are the rules if someone hasn't played roulette I've never played Roulette in my life so you basically have a you're sl it's slightly unlikely for you to win right so this is a useless simulation we

15:24know that the actual reality is you should not play roulette because the house wins um but the point of these things is you could test different strategies if you wanted to so if maybe instead of saying it's a 10 EUR payche either way you could maybe say okay what if I have to pay a th000 EUR in but now

15:41I get a 1050 payout and I lose 10o how many times do I need to play to make that worthwhile you can start to test out all of these sort of strategies and that's why one of the reasons as to why these things are very helpful in finance for SLE um so we have lots more

15:56parameters in this one we can sample many times so we can have you know 100 P of people who start with a P sugar and we can see how they do we can always also kind of increase our data volume by going out to a different time Horizon maybe a thousand turns 3,000 turns 5,000 turns and then there's two kind of

16:18different versions that I've benchmarked on there's one where we return every amount of money at every after every turn for every player and then one paer we basically say what is the average and the error associated with that average as we go kind of come back across the the data set and actually here it's a much different story so this is um the

16:41execution time uh for a thousand turn kind of time limit unaggregated and we can already see suddenly dctb is very competitive actually compared and so really what we'll see here is that the the main rivalries is between duct TB and numai uh numai is what chat GPT will answer uh your Google Lista as to what you should use uh and so dtb in this case is

17:04already uh better um in the aggregated

17:08case it's slightly worse but still compar comparable uh and then it gets uh it

17:16kind of continues off Trend right for the case when we expand it to a different time Horizon uh three Ty turns and then finally in the case of five Ty turns we actually see it starts to really well uh and so here it's significantly faster in my opinion like we have to always think back to sample size right so these things are generally

17:37linear so if you're twice as fast means you can take twice as many samples in our sort of like cost optimizing which means that your your um kind of accuracy of measurement will will go up you know significantly uh so we can see that you know DB is now suddenly faster on the aggregated side and it's a lot faster on

17:59the on aggregated side uh there's actually um helpful visualization we'll see but I guess one one important thing is that these longer term strategies like 5,000 turns here in the case of this example that we're actually running again it's irrelevant we we know the answer you can see the downwards Trend if you look at the numbers that these

18:17are generating but these are very helpful when you have strategies that maybe don't converge as quickly uh and it's it's very very statistically quite noisy it's meaningful to have accelerations at this sort of scale uh let's look at the memory profile um so this is the 200,000 samples up to a

18:353,000 term limit turn limit so you can multiply those numbers together to see how many like floating points we're using here basically we're free speaking um so that's a I guess yeah like what a

18:48800 million um no maybe not I'm not no 6 600 million

18:54right maybe I'm missing a zero so uh duct TB here in the blue uh it kind of does what we would expect right its memory footprint increases over time and then it dies away uh numpy kind of comes up and then comes flying up polers is just kind of chilling out up here N N I haven't really tried to optimize these

19:13guys so you know this I think these would be better if there was more effort put into them uh but yeah the the comparison we need to focus on is you know wgb does quite well actually in the aggregated case dtb performs this this same thing so I think this me memory is actually red herring um but yeah so so this is a case

19:33where I think DB is almost as fast as lpai but it's memory sort of profile uh looks unpromising and we'll actually come back to that at the end in terms of conclusions um but yeah so here in terms of scalability like how Okay so maybe duct TB is you know appropriate for this sort of stuff how might we be able to

19:51scale it uh well everything is kind of single node the way I've done it in Monte Carlo Sims it's fundamentally scalable because you're just doing these random samples so you can scale laterally extremely easily you can generate you know you can spin up 10 boxes and to do the same sample it's exactly equivalent to doing it on one

20:10and they're completely independent so you don't really have to care about the network overhead there's no coordination needed um so yeah here there's like not much CPU pressure in most of these it's mostly memory pressure just this is me looking at the activity monitor on my Mac and so I think you could like very aggressively multi TR if you wanted to

20:29on the duct DB side because there's this memory you know there's not super high memory usage uh even for these longer samples um and then yeah in terms of scalability when we come to this like let's imagine we spin up 10 laptops to make 10 times as much data and now we generate 10 times as many raw files now

20:48there's some very nice features that ddb has for example filter push down and like distributed query Behavior where we can now aggregate these things up quite effectively um um so there's a few different DB learnings I'm trying to keep an eye on the time and I I think I might need to step on a little bit but

21:06um we've got this aggregation thing here right from the very flat memory profile what I think is happening and again this is the guess from a python developer uh is probably that because this these this data is completely ethereal what I would guess this happening is that dtb is kind of computing it right to the end in

21:24small batches then kind of having some store at the end and then it's constantly doing so it's never holding a huge amount of memory like maybe nump is as we can see uh that just means that it's more parallelizable principle because generally memory uh or at least in the example of these case it's memory that's that's restraining it's not CPU

21:42um also just like some very specific things arrow is attf for large data sets in terms of writing out results I wanted to keep it fair I brought everything back to the python interpreter at the end um for smaller data sets DFS is faster than arrow for some reason um a few other things just random bits that I

22:00kind of ran into uh there's this dctb inbu thing like standard deviation function this is actually the duct TB webshell that Mii was talking about earlier um if you use the standard deviation in the uh as a like just a function uh on exactly the same data set uh versus writing it out by hand it is quite a significant time difference so

22:22like 9930 milliseconds versus 660 it's just one interesting thing again I guess there's some parallelization happening in here that's maybe not not happening in in here in the query engine uh and oh oops one other thing this is the roulette Guy this is actually some squel um you know duct TB

22:42is is great because it has a query planner and it'll optimize things but sometimes there's a skill in in actually doing stuff by hand as well so uh this casino thing smells like window functions but if you implement it as a window function you know this is just two exactly the same same things this one comes out to be 45 milliseconds if

23:00you implement it as a recursive CTE it's 27 milliseconds and again there's reasons for this but I want you to go into so in conclusion duct DB is woohoo

23:10a viable option for Mont car simulations at least in my opinion uh it is better in higher path complexity situations like when we've seen these long time Horizons um something I haven't really talked about but it's got this uh python function uh API and so if you if you were running into issues where you're like oh I want to use this python

23:31function but I can't because it's a SQL you kind of can it might just be a little bit slower and then that's going to come back to the your mileage may very Point uh and yeah this is a really important point because I don't want to be harsh I think num in general will be one of the more compelling ones of this

23:46of this Library I had lower effort to to optimize both number num poers I really focused in on the nump versus Dub comparison uh and yeah I guess the most important point is if you want to if you want to run this Benchmark Suite on your own laptop I've got it up here uh mc- Benchmark I plan to make this more

24:07mature over time uh and ideally visualize it in a better way than just some uh some random tables in my thing uh and I guess yeah at some point I will probably publish a Blog somewhere or a post I have no idea where somewhere is going to be um when I know that there's Fair comparisons because when you Google

24:25this stuff it's really hard to figure out where is where and swad and people will say implement it in C++ yourself which is crazy in my opinion um but yeah that's that's pretty much it so hopefully uh whenever you're thinking about running a Monte Carlo simulation to figure out how you can go through an asro field you'll use du TB in future uh

24:43but yeah that's it for [Applause] [Music]

24:54me

FAQS

Can DuckDB be used for Monte Carlo simulations?

Yes, DuckDB is a viable option for Monte Carlo simulations, especially for scenarios with higher path complexity. In benchmarks comparing DuckDB to NumPy, Numba, and Polars on a roulette simulation with 5,000 turns, DuckDB was faster than all competitors on both aggregated and unaggregated results. DuckDB's SQL-based approach using generate_series and aggregate functions is a convenient way to express simulation logic while getting competitive performance.

How does DuckDB compare to NumPy for Monte Carlo simulations?

For very simple operations like estimating Pi (random point sampling), NumPy is faster, about 4.5 seconds vs. 20 seconds for DuckDB at 1 billion samples. But for more complex simulations like multi-turn roulette with many parameters, DuckDB becomes increasingly competitive and eventually faster as path complexity and time horizons grow. DuckDB also maintains a flatter memory profile, processing data in small batches rather than loading everything into memory like NumPy does.

What are the memory and performance tips for running Monte Carlo simulations in DuckDB?

Key findings include: increasing chunk_size in fetch_record_batch speeds up throughput for GPU-hungry models; DuckDB's built-in standard deviation function is much faster than computing it manually in SQL (660ms vs 9930ms); and recursive CTEs can be faster than window functions for sequential simulation logic (27ms vs 45ms). Setting explicit memory limits via DuckDB configuration options helps avoid out-of-memory crashes at the cost of slightly slower execution.

Monte Carlo Simulations With DuckDB ft. James McNeill

Transcript

FAQS

Can DuckDB be used for Monte Carlo simulations?

How does DuckDB compare to NumPy for Monte Carlo simulations?

What are the memory and performance tips for running Monte Carlo simulations in DuckDB?

Related Videos

LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics

The Unbearable Bigness of Small Data

In the Long Run, Everything is a Fad