Mehdi demonstrates an experimental feature of DuckDB : running PySpark code with DuckDB engine ⚡ Note : This is not yet supported on MotherDuck
Transcript
0:00I began working with Apache sparkk in 1.0 and yes that placed me in the dinosaur category on the data War timeline switching to Apache spark was a breath of fresh hair in terms of the developer workflow compared to the aduk map ruce jobs back then it's like you have to make your own bricks and mix the cement by hand while we spark it's like
0:22you get pre-made bricks and cement mix and you just have to put them together wait a minute is that the reason why is called Data Bri anyway spart facilitates the development of data pipelines but that was a decade ago and now we are in a different era with different compute power and Frameworks so maybe it's time for a change the dctb team has started
0:42the work of offering a spark API compatibility bottom line the same P spark code base but DB under so let's quack about it so what are the challenges When developing with Apache spark well spark has been designed to work on a cluster and when dealing with small to medium data having a network overhead for a cluster makes no sense giving the power
1:08of the current machines when you think of it spark was built around that time and if you look at the Bas machine available on 9 years back then specs were ridiculously low compared to what you can have today so the reality is that for a lot of Spark by PLS especially daily or incremental workloads we don't need that many
1:28resources and officially that many notes so spark ends up printing at the minimum setup creating a lot of over it there are other Frameworks available for data processing but often team have tried to enforce spark everywhere to simplify their code base and that makes sense because it's also reduced the complexity by limiting the number of data processing Frameworks however there are
1:51two reasons why you want sometimes a lightweight setup meaning a single note aashi spark with small resource requirements first is small pipeline as we said typically daily early workload and second is the local development setup un integration and end to end test and also CI pipelines so giving this context the minimum specifications provided by cloud provider for sever
2:17spark is often implying a two node cluster let's take some concrete examples a spark severals product like AWS clue authorize a minimum configuration of two dpus one stard dpu provides four vcpu and 16 GB of RAM build per second with a one minute minimum building duration that means that at minimum you have 32 GB of RAM
2:43with eight vcpu that you pay plus you will always have to pay for at least 1 minute and one minute is really long in our haira like you could subscribe to this channel in 10 seconds I'm just saying Google cloud serverless data proc has roughly the same numbers I'll provide the link in the description for both pricing so that you can check that
3:04up yourself note that data rck has offered a single node option since late 2020 but it's not really a full serverless spark offering and it has some limitations long story short because of its initial design the cloud offer little to no option to have a small lightwe spark job all right let's talk about Java we touched on the
3:25resource efficiency for small spark job but let's not Overlook the hurdle of managing spark artifact themselves especially for p spark as you need basically to have Python and Java Java is like that X who keeps popping up in your life just when you thought modern language had moved on a common way to package spark artifact is through a
3:46container and it's actually challenging to keep the size under 600 megabyte uncompressed and actually if you look at the official bpark image it's a bit less than 1 GB uncompressed and on the other side because du DB can just be installed with a python package and it's build C++ so it's really light the base darker image that include python like this one
4:09will only take 216 megabytes uncompressed of course we can make both sides more efficient but that gives you an idea of how much you could save VI your base image container so why does that matter because cutting down on container image size my seem minor is actually linked to many fixs larger images leads to longer CI for building
4:33pulling and pushing which leads to higher costs and longer development time because this action needs to be done locally When developing or when of putting a PR which leads to less productivity it's also important to note the start of time difference between a Python script and an Apache spark JW because Apache spark is relying on the gvm there is always a goal start between
4:5710 to 5 seconds I again minor but it makes python script execution faster and therefore impacting the overall development time in iterative processes a trending architecture we have today is that people adopt the strategy of pting their data on a object storage like a wss3 data lake lake house the leveraging open file format like par or table
5:20format like Delta Lake hoodie or Iceberg and then they use a SQL inine would it be a cloud data warehouse and for p SQL users switching to different compute engine assuming the SQL dialect is compatible of course start to be a reality especially with the usage of framework like DBT DBT are of different adapter so it's really easy to send the
5:42same SQL go base against different in Computing G and you have a clear separation with the storage layer and the object store so why would it be possible for apachi spark to use a different execution inine under the hood with the same code base and that's the goal of tug DB is bpark API compatibility so let's dive in here
6:02we're going to cover a really simple example for this demo you only need Docker desktop or any local container service compatible with standard container commands such as runer for desktop would works too as I said in the intro dgdb team has released as part of the v0.9 dgdb version and it's an experimental P spark API compatibility you can find the complete Cod Bas Below
6:26in the description and I'll make a short break right now so that you have time to clone it oh I see you're lazy okay I'll do all the work first we need some data and we'll be using the open data set from ACR news that mother duck is hosting data is hosted on a tree it's about one file of 1 GB of bar and we
6:47download it locally so just use the commands make data and grab a hot beverage the time that the data is getting the load you should not have the data located in the data folder so let's look at our bpark job the bpark script contains a conditional import that look for an En rment viable to be able to
7:10easily switch engine either for du DB or pure py spark and the rest of the script remain the same byy spark code in this pipl we are looking if posting more on acur news gets you more score on average so first we're going to run make duck spark which is just calling the python script again in a Docker container in
7:32the container definition we just install python tdb package and we set the environment viable use dub to true and here is the result of the timing so let's run now the same code based against pure py spark make py spark this is also using a container image the official one and so here are the time result for each enine and the data
7:55result of course and as you can see there is no need to worry about under posting on ACR news as the algorithm doesn't necessarily favor those who post more and when it comes to Performance it's obvious that tkdb is faster for this pipeline of course this video isn't a comprehensive Benchmark for local processing and you shouldn't pick your
8:16tool based only on Benchmark they are always biased and that being said for a more realistic comparison you can check out Neil's class blog on using dir DB instead of spark in DBT pipeline I'll put the link into the description he did a better job in benchmarking using the TPC DS Benchmark a standard in the industry for comparing database
8:39performance but takeaway for the demo is it's faster lighter stronger with just an import condition in our bpar job at this time of this video currently the API supports only reading from CSV paret and Json format so it's not quite ready for real pipeline usage as writing function are necessary and other stuff plus the number of available functions is limited
9:08however you could start using it for unit testing because unit testing function in spark often involve reading a sample data set checking and transformation function in memory with no writing needed you could use similar logic to switch between dug DB and Spark when you do local test to speed up F so integrating spark with dougb can accelerate the development process and
9:32in the future help simplify pipeline reducing the overhead and the cost associated with minimal spark cluster we've seen how bypassing the gvm can make pipelines with small data faster and more cost efficient especially around development CI and execution finally this five spark API in dug DP marks a significant Milestone as it is the first python code in dug DB and
9:58because it's py contribution is much easier so dive into the existing code base I'll put the links into the description and explore the open issues your inputs and contributions can make a difference so finally it looks like spark and Quack I'll see you soon in the next [Music]
10:21one
Related Videos

2025-11-20
Data-based: Going Beyond the Dataframe
Learn how to turbocharge your Python data work using DuckDB and MotherDuck with Pandas. We walk through performance comparisons, exploratory data analysis on bigger datasets, and an end-to-end ML feature engineering pipeline.
Webinar
Python
AI, ML and LLMs

2025-11-19
LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics
LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.
AI, ML and LLMs
MotherDuck Features
SQL
Talk
Python
BI & Visualization

2025-09-24
DuckDB At Scale
DuckDB is loved by SQL-ophiles for small data workloads. How do you make it scale? What happens when you feed it Big Data? What is DuckLake? This talk answers these questions from real-world experience running DuckDB in the cloud.
MotherDuck Features
SQL
Talk
BI & Visualization
Python

