Making PySpark code faster with DuckDB

2023/11/16Featuring:

Mehdi demonstrates an experimental feature of DuckDB : running PySpark code with DuckDB engine ⚡ Note : This is not yet supported on MotherDuck

Transcript

0:00I began working with Apache sparkk in 1.0 and yes that placed me in the dinosaur category on the data War timeline switching to Apache spark was a breath of fresh hair in terms of the developer workflow compared to the aduk map ruce jobs back then it's like you have to make your own bricks and mix the cement by hand while we spark it's like

0:22you get pre-made bricks and cement mix and you just have to put them together wait a minute is that the reason why is called Data Bri anyway spart facilitates the development of data pipelines but that was a decade ago and now we are in a different era with different compute power and Frameworks so maybe it's time for a change the dctb team has started

0:42the work of offering a spark API compatibility bottom line the same P spark code base but DB under so let's quack about it so what are the challenges When developing with Apache spark well spark has been designed to work on a cluster and when dealing with small to medium data having a network overhead for a cluster makes no sense giving the power

1:08of the current machines when you think of it spark was built around that time and if you look at the Bas machine available on 9 years back then specs were ridiculously low compared to what you can have today so the reality is that for a lot of Spark by PLS especially daily or incremental workloads we don't need that many

1:28resources and officially that many notes so spark ends up printing at the minimum setup creating a lot of over it there are other Frameworks available for data processing but often team have tried to enforce spark everywhere to simplify their code base and that makes sense because it's also reduced the complexity by limiting the number of data processing Frameworks however there are

1:51two reasons why you want sometimes a lightweight setup meaning a single note aashi spark with small resource requirements first is small pipeline as we said typically daily early workload and second is the local development setup un integration and end to end test and also CI pipelines so giving this context the minimum specifications provided by cloud provider for sever

2:17spark is often implying a two node cluster let's take some concrete examples a spark severals product like AWS clue authorize a minimum configuration of two dpus one stard dpu provides four vcpu and 16 GB of RAM build per second with a one minute minimum building duration that means that at minimum you have 32 GB of RAM

2:43with eight vcpu that you pay plus you will always have to pay for at least 1 minute and one minute is really long in our haira like you could subscribe to this channel in 10 seconds I'm just saying Google cloud serverless data proc has roughly the same numbers I'll provide the link in the description for both pricing so that you can check that

3:04up yourself note that data rck has offered a single node option since late 2020 but it's not really a full serverless spark offering and it has some limitations long story short because of its initial design the cloud offer little to no option to have a small lightwe spark job all right let's talk about Java we touched on the

3:25resource efficiency for small spark job but let's not Overlook the hurdle of managing spark artifact themselves especially for p spark as you need basically to have Python and Java Java is like that X who keeps popping up in your life just when you thought modern language had moved on a common way to package spark artifact is through a

3:46container and it's actually challenging to keep the size under 600 megabyte uncompressed and actually if you look at the official bpark image it's a bit less than 1 GB uncompressed and on the other side because du DB can just be installed with a python package and it's build C++ so it's really light the base darker image that include python like this one

4:09will only take 216 megabytes uncompressed of course we can make both sides more efficient but that gives you an idea of how much you could save VI your base image container so why does that matter because cutting down on container image size my seem minor is actually linked to many fixs larger images leads to longer CI for building

4:33pulling and pushing which leads to higher costs and longer development time because this action needs to be done locally When developing or when of putting a PR which leads to less productivity it's also important to note the start of time difference between a Python script and an Apache spark JW because Apache spark is relying on the gvm there is always a goal start between

4:5710 to 5 seconds I again minor but it makes python script execution faster and therefore impacting the overall development time in iterative processes a trending architecture we have today is that people adopt the strategy of pting their data on a object storage like a wss3 data lake lake house the leveraging open file format like par or table

5:20format like Delta Lake hoodie or Iceberg and then they use a SQL inine would it be a cloud data warehouse and for p SQL users switching to different compute engine assuming the SQL dialect is compatible of course start to be a reality especially with the usage of framework like DBT DBT are of different adapter so it's really easy to send the

5:42same SQL go base against different in Computing G and you have a clear separation with the storage layer and the object store so why would it be possible for apachi spark to use a different execution inine under the hood with the same code base and that's the goal of tug DB is bpark API compatibility so let's dive in here

6:02we're going to cover a really simple example for this demo you only need Docker desktop or any local container service compatible with standard container commands such as runer for desktop would works too as I said in the intro dgdb team has released as part of the v0.9 dgdb version and it's an experimental P spark API compatibility you can find the complete Cod Bas Below

6:26in the description and I'll make a short break right now so that you have time to clone it oh I see you're lazy okay I'll do all the work first we need some data and we'll be using the open data set from ACR news that mother duck is hosting data is hosted on a tree it's about one file of 1 GB of bar and we

6:47download it locally so just use the commands make data and grab a hot beverage the time that the data is getting the load you should not have the data located in the data folder so let's look at our bpark job the bpark script contains a conditional import that look for an En rment viable to be able to

7:10easily switch engine either for du DB or pure py spark and the rest of the script remain the same byy spark code in this pipl we are looking if posting more on acur news gets you more score on average so first we're going to run make duck spark which is just calling the python script again in a Docker container in

7:32the container definition we just install python tdb package and we set the environment viable use dub to true and here is the result of the timing so let's run now the same code based against pure py spark make py spark this is also using a container image the official one and so here are the time result for each enine and the data

7:55result of course and as you can see there is no need to worry about under posting on ACR news as the algorithm doesn't necessarily favor those who post more and when it comes to Performance it's obvious that tkdb is faster for this pipeline of course this video isn't a comprehensive Benchmark for local processing and you shouldn't pick your

8:16tool based only on Benchmark they are always biased and that being said for a more realistic comparison you can check out Neil's class blog on using dir DB instead of spark in DBT pipeline I'll put the link into the description he did a better job in benchmarking using the TPC DS Benchmark a standard in the industry for comparing database

8:39performance but takeaway for the demo is it's faster lighter stronger with just an import condition in our bpar job at this time of this video currently the API supports only reading from CSV paret and Json format so it's not quite ready for real pipeline usage as writing function are necessary and other stuff plus the number of available functions is limited

9:08however you could start using it for unit testing because unit testing function in spark often involve reading a sample data set checking and transformation function in memory with no writing needed you could use similar logic to switch between dug DB and Spark when you do local test to speed up F so integrating spark with dougb can accelerate the development process and

9:32in the future help simplify pipeline reducing the overhead and the cost associated with minimal spark cluster we've seen how bypassing the gvm can make pipelines with small data faster and more cost efficient especially around development CI and execution finally this five spark API in dug DP marks a significant Milestone as it is the first python code in dug DB and

9:58because it's py contribution is much easier so dive into the existing code base I'll put the links into the description and explore the open issues your inputs and contributions can make a difference so finally it looks like spark and Quack I'll see you soon in the next [Music]

10:21one

FAQS

How can DuckDB run PySpark code without Apache Spark?

DuckDB offers an experimental PySpark API compatibility layer (released in v0.9) that lets you run the same PySpark code with DuckDB as the execution engine instead of Spark. You use a conditional import based on an environment variable to switch between DuckDB and pure PySpark, while the rest of your pipeline code stays unchanged. Your existing PySpark codebase can benefit from DuckDB's faster single-node processing without a rewrite.

Why is DuckDB faster and lighter than Apache Spark for small to medium data pipelines?

Spark was designed for distributed cluster computing, which creates overhead for small to medium workloads. Cloud serverless Spark offerings like AWS Glue require a minimum of two DPUs (32 GB RAM, 8 vCPUs) with a one-minute minimum billing. DuckDB runs as a single Python package built in C++, with a base Docker image of only 216 MB compared to nearly 1 GB for PySpark. It also eliminates the JVM cold start time of 10-15 seconds that Spark requires.

What are the current limitations of DuckDB's PySpark API compatibility?

At the time of the video, the PySpark API in DuckDB supports only reading from CSV, Parquet, and JSON formats. Writing functions are not yet available, and the number of supported DataFrame functions is limited. However, it can already be used for unit testing, since test functions typically involve reading sample data and checking transformations in memory without writing output.

How does switching from Spark to DuckDB reduce development costs?

Switching to DuckDB cuts costs in several ways. Smaller container images (216 MB vs ~1 GB) lead to faster CI builds, pulls, and pushes. No JVM cold start time (10-15 seconds for Spark) speeds up iterative development. You avoid minimum cluster costs from cloud Spark providers. And since DuckDB is a Python package with no Java dependency, environment management is much simpler. The PySpark API compatibility means teams can maintain a single codebase.