DuckDB breaks the lakehouse? ft. Daniel Beach

June 5, 202647:06

Hosted by Mehdi Ouazza, Dumky de Wilde · With Daniel Beach (Data Engineering Central)

Daniel Beach (Data Engineering Central) joins Mehdi and Dumky for a tour through the open lakehouse renaissance: catalog commits bring multi-writer Delta + Unity Catalog + DuckDB without Spark, Cloudflare ships a unified data platform with an AI agent called Skipper, ADBC quietly replaces JDBC/ODBC, Aaron Francis launches SoloTerm, Addy Osmani names the orchestration tax, and Mitchell Hashimoto gets a 40x speedup with an agent in a loop.

$catnotes

Chapters

  • 00:00 — Cold open · who's on today
  • 01:43 — Catalog commits · Delta + Unity Catalog + DuckDB
  • 10:46 — Cloudflare unified data platform · Skipper agent
  • 17:53 — Education in the age of AI · language as interface
  • 24:39 — ADBC · the next-gen database connection
  • 30:17 — SoloTerm · the terminal in the browser
  • 38:56 — The orchestration tax · Addy Osmani
  • 40:38 — Mitchell Hashimoto · an agent in a loop
  • 45:50 — Databricks Zerobus · streaming into the lakehouse
  • 47:06 — Outro

Show notes

Mehdi Ouazza (MotherDuck DevRel) and Dumky de Wilde (MotherDuck DevEx) are joined by Daniel Beach of Data Engineering Central for a 47-minute rundown of the most interesting data + AI news of the past two weeks.

We open with the Delta Lake + Unity Catalog + DuckDB story, specifically the catalog commits feature that lets multiple DuckDB processes write to the same lakehouse table without stepping on each other. Daniel argues this is a quietly massive deal that nobody is talking about: it removes one of DuckDB's last real weaknesses for lakehouse workloads and breaks the Spark + Databricks dependency for concurrent writes.

From there: Cloudflare's new unified data platform and their AI agent Skipper, which builds extra context via MCP so non-data folks can self-serve. A Guillermo (Vercel CEO) tweet kicks off a thread on education in the age of AI — language as the new interface, and what kids actually need to learn. Daniel walks through ADBC, the Arrow-native replacement for JDBC/ODBC that's quietly powering a lot of modern tools.

Dumky shares SoloTerm from Aaron Francis, a terminal that runs in your browser, scriptable for tutorials and demos. Mehdi covers Addy Osmani's orchestration tax essay (humans as the single-threaded bottleneck), then Mitchell Hashimoto's now-famous post about an agent in a loop optimizing a renderer from 82ms to 2ms. We close with a teaser of Databricks Zerobus, streaming straight into the lakehouse without the Kafka glue layer.

Key takeaways

  • Catalog commits are "a big deal that nobody's talking about": multiple DuckDB processes can now write concurrently to the same Delta + Unity Catalog lakehouse table — one of DuckDB's last real lakehouse weaknesses, gone. Drop a DuckDB-in-a-Lambda where a tuned Spark pipeline used to be and that's real money saved.
  • Cloudflare is doing far more data engineering than the "CDN + tunnels" framing suggests: a unified data platform (R2 SQL + Iceberg) with Skipper, an AI data agent that uses MCP as the integration layer so non-data folks can self-serve.
  • Language is the new universal interface. And half the time, "skill-writing" an LLM for data work is just teaching it the vocabulary — row groups, min-max indexes, batch inserts — not writing rules.
  • ADBC quietly took over: JDBC/ODBC are being replaced under the hood by the Arrow-native standard. Snowflake, DuckDB, Databricks, and dbt Fusion all read Arrow natively now — no more row-by-row serialization, and most people don't even know they're using it.
  • SoloTerm (Aaron Francis) puts a scriptable terminal in the browser — perfect for tutorials, demos, and courses.
  • The orchestration tax (Addy Osmani): humans are the single-threaded bottleneck in multi-agent workflows. We can spin up 10 agents but can't review 10 PRs at once — the next bottleneck isn't tokens, it's attention.
  • "Let the loop cook": Mitchell Hashimoto ran an agent overnight optimizing his terminal renderer and got a 40x speedup (82ms → 2ms) — the most concrete proof point of agent-in-a-loop this month.
  • Databricks Zerobus teaser: a few lines of Python to stream straight into the lakehouse, no Kafka glue layer. Possibly the spiritual successor to Delta Live Tables — full coverage in E04.
  • Backstory drop: DuckLake almost got built by Daniel's old consulting team at Xebia, and DuckDB Labs asked Xebia for permission to use the name. Small world.

0:00Mehdi: Hello everybody and welcome to another podcast of Explain Analize Podcasts where we get all the latest news links around data and AI and just chat about it, rant about it, be happy about it, be depressed about it, whatever you're feeling today. I am with Dumkey, which is also Darrell at Mother Duck, and a special guest today, Daniel Peach. Daniel, how is it going?

0:27Daniel: Eight. Good luck. Super happy to talk to you guys. I'm looking forward to hopefully not being depressed about anything. I don't know. I'm not sure what articles you brought.

0:28Mehdi: Yeah. Yeah, I'll see. I'll see what

0:36Dumky: Yeah, Medi Medi has a wide range of emotions, when he reads stuff. So let's see.

0:39Daniel: Okay, I'm worried.

0:41Mehdi: Yes. And also, I mean, depending on this episode is recording, right? Sometimes I record it on Friday, it feels like the weekend is closed. But anyway. so we have can you actually quickly present yourself, Danielle? For people that's been living in a cave and don't know what kind of content do you do?

0:53Daniel: Sure. Yeah, sure, I've been doing content, data content, man, for maybe since like twenty seventeen, twenty eighteen, WordPress, Substack, yeah, LinkedIn. I do a little bit of YouTube, I haven't cracked that nut yet though, but yeah, I just write a lot about data things, pretty much everything under the sun and try to be a little spicy, you know, keep keep things interesting. Yeah, that's me.

1:04Mehdi: One of the OG. Yeah, cool. So you have your Substack, which is at what's the name again? Data Engineering Central. Yeah. We'll put the link and that's a good point that all the links that we've been discussing today will be available on the description if you're reading on YouTube. and you can have them on motherduck.com slash explain analyze.

1:22Daniel: Mm-hmm. Data Engineering Central. Yeah, go check it out. Like and subscribe, share it.

1:43Mehdi: Pick your episode and have the links and get notified on the next episode. So let's get started. First link, honor to the guest. You pick up round Delta Lake, it's actually on the website of the Delta the Linux Foundation product. I mean Delta Lake.

1:58Daniel: Yeah, I don't know. I think it's like I don't know why Databricks does it. We all know this is a Databricks article, right? Just 'cause it says Delta like

2:02Dumky: Yeah.

2:03Mehdi: But so what is funny is that actually Ben is working, let me check. Ben is working at DuckDB Labs, which is not doesn't exist. It's DuckLabs. They have been renaming, branding. so I guess it's a it's a tight collaboration between Duck Labs, so previously known as DuckDb Labs.

2:10Daniel: Mm. Mm-hmm. Yeah, this this article actually has been open in my browser for quite a while. I have like a half-written substack article on this because I don't know, so I thought it was super interesting. Not so much like the time travel, you know, but if you s I feel like it's sort of hidden because I would have not called this article this. I mean, if you scroll down like a little bit in there, there's something called catalog commits and it uses we could talk about it, but uses DuckDB to actually do some stuff. And I just feel like the article is I don't know, they should have named it something else. I feel like it's a big deal that nobody's talking about, right? 'Cause if we think about like the lakehouse architecture, what's one of the biggest complaints? I mean, why do you th I I mean part of the reason I think DuckD B got popular is like people are tired of running spark clusters and paying for that to have to work on their lake house architecture that's supposed to be open source. And I think one of the biggest challenges I've seen even personally at my job is

3:04Mehdi: Mm-hmm.

3:13Daniel: you know, these lake house architectures, well, it's Delta Lake Iceberg, etc., these are just file systems, right, stored in S3 or somewhere else. And you have this whole problem of like, you know, I mean, well, probably came from like a database background, you have like these conflicting commit stuff, right? How do we have multiple things writing to a lake house and not corrupting or reading the right thing, right? This classic problem. Which is a whole nother topic, but yeah, this article kinda goes into open source unity catalog and then like if you scroll down a little bit it's got it talks about something called Yeah.

3:44Mehdi: Yeah. So just for context for the people listening to us, the title is Delta Grows Up, Write Unity Catalog and Time Travel. And it talks about DuckDb is Delta and Unity Catalog extension that goes away from their experimental tag and now we've write and unity catalog and time travel. You want me to scroll?

4:03Daniel: Yeah, it's not just rights. Yeah, to scroll down. It's not just rights. Like I've been writing with DuckDb. I've actually I work kind of ripped out some polar stuff and put DuckDb in place. I I've been able to write or, you know, use DuckDB to work with the unicatalog Delta Lake tables for a while, but I feel like the big news here to me was these like what do they call Catalogs manage table of like commits, basically catalog commits.

4:23Mehdi: Mm-hmm.

4:25Daniel: Yeah, I think down there, they're four CC catalog commits. They just kind of like sneak it in there. And they don't really put it in the title, but it I think the cool thing is, and they kind of show here, you know, like inserting having multiple DuckDB instances like right into this table, and you know, kind of it does the classic thing you would think, right? Like, I can insert these records, this other DuckDB process. okay, now I'm free, I can insert, and it's kind of this idea of being able to have multiple concurrent processes interact in a real way with your lake house, which typically people have had to if they want to do that, you know, they've been to go on Databricks and use Spark and all that stuff, which is fine, but I think this just opens up like feels like it opens up the floodgates to me, I don't know.

4:52Mehdi: Yeah, multiple rider. Yeah, so the way I understand it is there is like multi writer for Dugdb support there indirectly because that's always us being

5:10Daniel: Mm.

5:14Mehdi: The one of the biggest weaknesses of DugDB, even if there is workaround, right? There is always workaround somehow. And that's one of them to Lake House and Delta Ec. Dumke, you're Yeah.

5:18Daniel: Mm-hmm. Yeah. Yeah, you can kinda s see it right there at the bottom. It talks about, you know, launching twenty duckdb concurrent instances inserting there. Yeah, it's I don't know. I feel like that's the coolest thing here. I'm surprised more people aren't talking about it because it's I don't know. Think about just putting DuckDb inside a Lambda and AWS, right? And you have files hitting somewhere, right? And writing to tables by these lambda you know, I don't know, it just seems like a big deal for architecturally.

5:28Mehdi: Yeah.

5:34Dumky: Yeah, yeah.

5:46Mehdi: And what Demkey you've been saying, what's what's your take on actually Delta adoption? Because I think that's maybe why it didn't get that much noise and love that they sh it should be.

5:57Dumky: It's it's it's hard. I have it it is I do feel it's it's very much tied to the Databricks ecosystem, right? That being said, there is a lot of companies, organizations on Databricks that I would assume don't mind decoupling a little bit more from Databricks. So this is definitely a nice way to do that, I would say. Like it opens up a lot of options to

6:04Daniel: Mm-hmm.

6:21Dumky: either pulling data from different different places or kind of exfiltrate data to different places. I think it takes yeah, it takes some time to to adopt this, I would say.

6:33Daniel: Yeah, I don't know. I feel like I don't know, everybody, you know, it's a classic thing where like, you know, most data teams, yeah, they have stuff they need to use Spark for, Databricks for, you they're using Databricks because they get some integrated platform, right? That's why people pick something like a Snowflake, Mother Duck, Databricks, whatever. You're picking this one integrated platform, but it you know, sometimes it puts you in a box, right? You have you wanna be able to be flexible with your data pipelines.

6:55Mehdi: Yeah, there is vandal locked in some ways anyway with I think any platform somehow.

7:00Daniel: Mm-hmm.

7:00Dumky: I I do wonder if if the people that are actively watching this, if there may be because I remember I I was a little bit hesitant when I was back when I was consulting, with the rise of something like Duck Lake and especially now with with all the the changing workflows due to like our our AI adoption and that that kind of stuff, like There is a lot going on where you're like, maybe I'll sit this one out for a bit and just see in a year like where we're going before I like make any major decisions on our architecture. so that's might maybe more of a gender that's not necessarily to this to this piece, though I I do think like this would be this could change some like architectural decisions, but I think in general maybe people are waiting out those

7:45Daniel: Mm-hmm.

7:48Dumky: architectural decisions, especially in larger larger organizations, to kind of see where the dust settles a bit with both the changes in in in data links. You think? Okay.

7:56Mehdi: I think it's changing. I think it's changing. I have a data point for that. Yeah. We so I was at the ADEO group and the catlong group, which is like one of the major retailer in France, like the number one and so ADEO also in in Europe around tool utility and so on. Anyway, they mentioned that their architecture strategy is three to six months now because it's changing so fast. So they cannot

8:04Dumky: Mm-hmm.

8:21Mehdi: take a strategy for a year when, you know, things are been accelerating with AI and stuff are changing. If you look back, nobody was talking about skills or, you know, a couple of months back. So if you build your strategies say, Yeah, we're gonna build skills for the world corporate for the next year, who knows? It's a good bet right now, I would say in that particular case example, is it for long term?

8:32Daniel: Mm-hmm.

8:47Mehdi: I do agree, I was exactly like you. how do you manage that, Danielle at your job? Because you're doing a lot of content on bleeding edge. What's the reality at your work?

8:55Daniel: Yeah, I mean we like to stay flexible, like we're heavy Databricks users, but at the same time we don't it's expensive, right? Like that's part of that's part of this conversation too, is the expense and people of you know, Databricks bill fatigue, compute fatigue.

9:04Mehdi: I see.

9:08Daniel: I mean I know they came out with serverless things to try to deal with that, but at the end of the day it's extremely expensive way to process data. Sure you can argue that total cost of ownership, whatever, all that kind of stuff at the end of the day. yeah, it cost a lot to process. on compute for somebody else's compute and the truth is I most data pipelines can be built in such a way in architecture that you just work things a bit at a time and yeah we use DuffDB all the time beside like Lambda's to you know work on flat files and things like this and this sort of article like this multi support to write in is yeah, a hundred percent.

9:38Mehdi: Yeah. So that's a good deal basically for existing Datalics users that have DataRake footprint and want to reduce their bill.

9:46Daniel: Yeah, it'll I mean this is a direct that's why I'm still surprised it's not a big deal 'cause this is like you go people always talk about saving money and then, you know, doing fifty backflips with spark tuning to save money, it's like or you could put Duck D B inside a lambda here and write to your lake house and just totally drop that other pipeline. That's like real money, so

9:50Mehdi: Ha ha Yes, that is true. All right. next yeah.

10:06Dumky: Maybe maybe one one funny story that I just I just realized actually that so when I was consulting at Zebia, we we did a lot of Databricks work and DuckDB was just coming out and so one of the things we did was that there was this Unity catalog extension in DuckDB. And so some of my colleagues build an extension built like a a product out of this for some of the Databricks customers that they then called DuckLake.

10:24Daniel: Mm-hmm.

10:33Mehdi: Yeah.

10:34Dumky: 'cause it it fits so well. Only to find out that like two months later DuckT B came out with Duck Lake and we're like, Okay, maybe this is not the best name ever.

10:35Daniel: Yeah. Alright, you should have sued him. Take a give me my money, man.

10:45Dumky: Yeah, exactly, exactly.

10:46Mehdi: No, so the the backstory is DeckLabs asked for name permission to Xivia. So yeah. All right, second link. Cloudflare. It's yours, Dumkey?

10:51Dumky: Yeah, yeah. Yes, yes. So I this is an interesting take to me for for many reasons. One is the the Cloudflare platform is traditionally obviously more more web and app related, but they're doing a lot of interesting data stuff. So I think R2 SQL So R2 is their like S3 compatible storage.

11:16Mehdi: Mm-hmm.

11:18Dumky: you can basically query R2. also with I I would assume Duck T B and background data are not really explicit about this, I think, but if I remember correctly it is it is Duck T B. But I think what's what's interesting here is it starts out as a normal like

11:24Mehdi: Yeah. Yeah.

11:35Dumky: Here's how we kind of re-architected our data platform. There are so many different sources, it's there's legacy stuff we need to bring all of that in. but then actually the interesting part comes at the bottom where they built this tool called Skipper, and that's basically their agent for interacting with their data platform, and If people have been following a little bit of our content on on on Mother Duck, then you've seen probably that we focus a lot on context. and so we think really that a lot of the text to sequel or text to insight questions that you ask or or problems that you present to an LLM. should be grounded in different layers of context, right? And so it was very interesting to see that that's exactly the the issue that that Cloudflower ran ran into as well basically. And so the way they solve that is very similar to what what some of us are doing as well. So there is this basically different kinds of of layers of context. There's the actual schema, there's the models, there's annotations, there's yeah, yeah. So this is like an internal tool that they build. Yeah. Yeah. and obviously like they're using their own tools as well, right? So you see the R2 data catalog, their their integrations. it was also funny for me to see that they mentioned that they still run stuff on BigQuery.

12:43Daniel: Is this their tool, the skip, or is this or is it open source? internal, okay. Mm.

13:02Dumky: I would kind of assume that a company of this skill like had their own stuff running. but I guess that's that's just legacy stuff there as well.

13:02Daniel: Ha ha ha Sure.

13:07Mehdi: En giant. Yeah, because here they mention like Trino and

13:15Dumky: Yeah, so that's kind of their their they decided that they needed one engine that brings everything together. So they have one interface to query everything basically.

13:26Daniel: Yeah, Cloudfair's interesting. I've written a few articles on there. Like I feel like they're a quiet there seem like quiet in the data space, yet when you look into it, you know, they have R2 and they have the iceberg, the R two catalogs. It's basically an iceberg managed table and I played around and I was like, man, this thing is sweet.

13:35Dumky: Yeah. Yeah.

13:36Mehdi: Yeah, they're pretty powerful, right?

13:41Dumky: Exactly. And and in terms of pricing, they have a pretty good deal on that as as well. So I've been playing around a lot, for for years and years I've been playing around with them for for mostly hobby projects, some professional projects. but the recent changes of so my background is in is in web data, right? And that quickly runs up your bill. If you like you're tracking every request, you get a lot of logs and you get a lot of data and you need a way to to handle that. And they've been

14:03Daniel: Mm-hmm.

14:07Dumky: publishing these these sort of pipelines that they now have the R2 SQL and they do like iceberg integrations well so they're kind of slowly building out their stuff and that's especially convenient if you're in that web space and you're you have a lot of request data coming in then that's a very convenient way of of getting stuff in in storage and then being able to query that with with iceberg

14:14Daniel: Mm-hmm.

14:30Dumky: in your actual data warehouse.

14:31Daniel: Have you guys found that I think that topic about the context is super interesting, especially in the data. I've kind of seen it from a Databricks perspective, but it is everybody's talked about the semantic layer for so long as like the solvable problem, but it is like in a data world, you're right, we have so much context, it's like what table joins to what? And what's the join key? And you know, what's the grain of this data compared to that? And You're not even getting to like what's this column mean? And yeah, where does all that stuff go? It's super interesting.

14:58Mehdi: So here they mention in in in the diagram that they rely on data for schema annotation, try no execute. But I wonder like they maybe they mentioned that somewhere, how they manage history of those queries and so on. I see they have tool charts that persist here too. but yeah.

15:18Dumky: Yeah, there's not not they don't specifically mention history. There there's one thing that I wanna call out in more around the the kind of MCP part that they they built for this skipper thing, and that is for The querying of that MCP, right? What they actually do is they they said, Hey, we started out with like 30 tools and it got too complex for the model to understand when it needed to call which tool. And so they switched to this code mode version where you just give your tool like a a search function so it can search the different tools that it has and a kind of query coding function. So Basically, what I thought was very interesting is that then it becomes easier to say, I want to do like these five tools in concession and take the output from tool A, feed it into tool B, and then feed that into tool C. And so instead of the model going back and forth, right, saying, like, hey, now I have the output of tool A, I can call tool B. You can actually directly call that, and you don't have to do the round trips. through the model that take more time. Right. So I think that's an interesting way to solve this problem of complex queries and finding the right schemas and table names and all these kind of things.

16:34Daniel: Mm-hmm.

16:38Mehdi: Yeah. Yeah, so they

16:39Daniel: So so what you're saying is we're not all gonna lose our jobs. Somebody still needs to know and build that. I'll have a job next year.

16:42Dumky: Someone still needs to build this and think about it, yeah.

16:46Mehdi: Yeah. But yeah, they say instead of defining thirty individual tools, we expose to church and execute. yeah, it's it's also really interesting to see that there is no standards in architecture on how you expose things to our D C P you have trade-off everywhere and how does that work with the platform? it's funny because they share what they do internally, right? But I'm pretty sure

16:51Dumky: yeah, yeah, yeah, exactly.

16:59Daniel: Mm-hmm.

17:10Mehdi: Like they have some recommendation for the data platform, but it's still like really custom and it's really, you know, on the edge. So it's again, it's hard to find standards. You just have to experiment. That's the T L D R.

17:22Daniel: Mm-hmm.

17:22Dumky: But but also like this is a giant company, right? They run like a quarter of the internet and they're still like their data team is doing the same stuff that we all do. Like they're trying to figure out how to bring all these sources together. Yeah, yeah, yeah, yeah. Yeah.

17:26Daniel: Sure.

17:28Mehdi: Yes, you

17:30Daniel: Mm-hmm.

17:33Mehdi: So don't don't feel bad. Yeah. Don't be depressed, you know, emotion time. If your feeling behind

17:39Daniel: Ha ha

17:41Mehdi: Cloudflare is, you know, barely holding up with a with everything. Yes.

17:44Dumky: Exactly. So this is that range of emotions that we were talking about. You start out depressed

17:44Daniel: Ha ha.

17:48Dumky: and you're like, Cloudflare is actually doing the same thing that I'm doing. It makes me feel good.

17:53Mehdi: Yeah. All right. Next. I want to talk about this one. I'm curious. so this was a tweet. I actually have only tweet for this week. I'm sorry, not meat content. it's Guillermo, so the CEO of versal we say hey this is what education in the AI looks start with the language, the linguistic surface is your roadmap. And the original tweet is that hey, to get good animation from AI, you need to get good at telling it what it wants. Stagger this list of item, make the animation directional. And so basically it built a motion vocabulary for this. So if you go there, you have a motion vocabulary with what is fade in, fade out, what is slide in. And that's for you know animation. I was curious like regarding data and data engineering, we We haven't think a lot about like I've been I mean Danielle and I, we've been and Dumkey have been educating people around data engineering and AI. And I haven't I haven't think about putting some specific vocabulary to help people prom what they should do, but I feel the challenge is also how deep you go into that vocabulary. For example, make the pipeline in dopetent. What is dopent and how do you explain it? But yeah, I was I've I was

19:06Daniel: Mm.

19:08Mehdi: I'm finding the concept really interesting. I don't know if you if you thought about it on how you do you actually whi which kind of vocabulary do you prompt for

19:15Daniel: I feel like part of the part of the problem is though I feel like it's a classic data problem where unlike if people are on classic software engineering teams, you know, everything's very like I don't know, it just has a reputation of, you know, we do things a certain way and data has always had this certain reputation where it's kind of Fly by the Cedar Pants, ad hoc, you know, some people are doing ETL, some people are doing ELT, some you know, it's like the at least my experience have been, it's like there's not that many standards around how people decide what a pipeline looks like in the first place, you know. I feel like that's part of the problem too, right? Is everyone does it so differently. And we have so many tools that it is an interesting problem. How do you explain something that, you know, yeah, you don't see people say like we're

19:40Mehdi: That's true.

20:02Daniel: like data sync. You know, sometimes I'm writing an article I say I talk about, you know, some data sync or some data source, but even that seems to be like not that many people use that sort of terminology. Does that make sense?

20:14Mehdi: Yeah. That i this could be a good study for LM. What what is your standard? What do you know? And what do you define as standard? I think

20:19Daniel: Mm. My guess based on what it's been I don't know. I mean I don't know how often some of these models are updated, but I don't know. I still feel like there's probably a lot of like old school stuff in there, right? There's probably a lot of old school data warehousing terminology in there, third normal form, like that type of stuff, right? Third normal form, I'm sure that's very whereas like maybe the

20:41Dumky: Yeah. Yeah.

20:44Daniel: Medallion architecture, right? Like that sort of stuff. Yeah, it's in there, but how mu like comparatively if we think about what this stuff has been trained on, like it's a huge amount of this probably old what I would call old older school stuff versus like how much was it actually trained on? medallion architecture, right? Yeah, who knows?

20:57Mehdi: That's true. We don't we don't know how much weight there is for news. Yeah, we don't know how much weight there is in the new part of the training of people chatting with it versus all the old

21:09Daniel: Yeah, I'm trying to think what I do on a daily basis. If I'm like writing a pipeline, I'm like if it's a spark something, I don't know.

21:13Dumky: So so the thing that I've found that maybe describes the way these models interact and and that makes them a little bit schizophrenic is that I feel that they are software developers that have come into a data world, right? So they can reason about OLAP and how data problems work and then they go and write code and it's all transactional and it's like row by row inserts into an analytical data warehouse. So I found myself

21:27Daniel: There you go.

21:40Mehdi: That's true.

21:40Dumky: writing these skills especially for like the Duck T B stuff where I basically educate the the model on like what is a rogue group what is the size of a row group in in in DuckTB and and for parquet files it does a little bit better I would I would say but especially around DuckT B like it needs to understand stuff like like min max indexes and and

21:56Daniel: Mm-hmm.

22:04Dumky: and row groups and and the fact that it needs to do batch inserting and that it's better to write like to a CSV first locally and then insert the whole thing instead of going row by row, for example. so those kind of problems are very interesting. I I only realize now that actually that that's the vocabulary thing that it needs to understand this.

22:16Daniel: Mm-hmm. Yeah. I've had the same problem working with AI and DuckDB. It's like I have to feed a lot of documentation of like you're doing this auth to AWS like it was done five years ago, not like six months a you know what I mean? Like

22:37Mehdi: Does this does this still happen? I feel like it's happened much less. Like I I've done like a demo where I I remember like a year ago I had to index docs and so on and point it explicitly. Now there is like web fetch and sometimes it's just gonna go fetch the latest doc first. But it does some wrong assumption, but it's not like five years ago. What is what's what's your take?

22:46Daniel: Mm. Mm-hmm. yeah, it does.

22:58Dumky: But that's so funny to me. Like sometimes most of the time it goes well and then like last week even I had something where it said, I started it at yesterday's date and then it said like twenty twenty four, like June twenty twenty four or something. I was like, Where did that come from? Yeah.

23:11Mehdi: Ha ha.

23:14Daniel: Yeah, I think like you just have to be specific, right? though if I assume that it's gonna do what I would want it to do, yeah, no, you just you gotta be specific about what your expectations are, I feel like otherwise. The more ambiguous you get, the more who knows what's gonna happen, right?

23:28Mehdi: But I do think it would be interesting to create like a cheat sheet of vocabulary on like independent, incremental versus snapshots because the model is gonna even if he knows the information, like let's assume he knows, is gonna take a decision which is is is not going to ask question often. I

23:33Daniel: Mm-mm. True. Mm.

23:49Mehdi: I start to see that sometimes now it gets a bit more curious. He asks me if I want a full snapshot or incremental, like those kind of level. But then like what's the partition you would like to, like how big is your data set? you know, to have always as much, you know, back pressure or some stuff like this. So I think yeah, I I I if anyway, it gives me some ideas. I think it's interesting. We haven't we've been educating people. how to prompt and you know write codes, but I think just a list on like here is the pipeline and here's the fo like that you a human can read. That's the difference between the skills and what you could read. Because like if you look at this one for animation, entrance and exits, sequence and timing, keyframe, define points in animation, zero. So that's like common term. But it's not hard to understand. And then you can, you know

24:19Daniel: Mm-hmm. Yeah.

24:39Mehdi: Find your way your way through to the model. all right, next next block. we have from our friend Oit Emerson. A D B C.

24:48Daniel: Yeah. For sure. I had I put I put two. This one and the next one are both like arrow. I don't know. I just feel like arrow is fun. Arrow is fun. Like it's something that everybody talks about sort of, but it's also doesn't get talked about that much. Arrows yeah.

25:04Mehdi: Backstage, yeah.

25:05Daniel: Every arrow's behind a lot of tools that people have no idea, but it's starting to I feel like it's especially in the last year, it's really starting to bubble to the top of like conversations where it's almost like an everyday conversation on the data community where I don't think it was a few years ago. It's kinda like more niche, if I wanna call it that. And it's been super interesting to see someone I've been around doing data for two decades where I think how people can relate to it great. I think this article's a great example of like cause most people are used to like OBDC drivers, right? That's the classic, been around forever. and this kind of the rise of A B D C and this the arrow drivers. It's a great like introduction into why would I even use Arrow? What is arrow? Is a column or format, it's fast, and then how it's starting to affect like every part of the data stack in like very fundamental ways, because and driver OBDC driver that's like super fundamental.

25:35Mehdi: Yeah, or GDBC, yeah.

25:58Daniel: thing and to see some of these tools that have been around for like 20 years or 30 years now getting replaced by Arrow in this case ABDC Hoyt has a ton of great content around arrows people

26:10Mehdi: So just as a recap for the listener, ADBC is basically a protocol, a new gen protocol to replace T D B C and ODBC, as you were saying, Daniel. and the big key here is that you don't serialize to row over the network. You basically have your source data, which is already a color data format, and your target destination, which is also colonar. And so the protocol take advantage of this rather than

26:25Daniel: Mm-hmm.

26:36Mehdi: serializing to row back and forth in the standard G D B C question. Like

26:40Daniel: Yeah, and I think that also like every what I love about arrows like pretty much every tool that anybody works, whether they know it or not, it works with arrow. Whether you can seamlessly go between Spark, DuckDB, Polars, whatever. For the most case, most good tools nowadays, it's most all of them. If you go check the docs, they have some sort of like to and from arrow call typically, right? So it's it makes it it just opens up the world, right? You start thinking about using arrow, you can, you know, be creative in how you design your

26:46Mehdi: That's true.

27:10Daniel: architecture in your pipelines 'cause you're not like in the past it's like, well I just work on Spark data frames so everything has to work with that. It's like, well not anymore with arrow, right?

27:19Mehdi: Yeah, yeah, yeah. And so

27:20Dumky: And then correct me if I'm wrong, but but that means that those tools also leverage like the the zero copy cloning stuff, right? So they can y basically use the same same copy of that that data without like having to put everything in memory.

27:34Daniel: Probably depends on the tool, but yeah.

27:36Dumky: Yeah, yeah, yeah.

27:37Mehdi: Yeah. But the big thing for me is that yeah, as you said, it's used and adopted by a lot of tools. Now for IDBC for the protocol. I'm not sure what's the adoption, I was just looking actually. Snowflake support it, Dugdb support it, DBT Labs apparently support it in DPT Fusion. Databreaks. Have you tried with Databricks?

27:57Daniel: Mm-hmm. Yeah, we used arrow quite a bit back and forth in Databricks it works great.

28:01Mehdi: Okay. But I feel like the The use case here are mostly like for consumer tool maybe? Like BI.

28:09Daniel: Yeah, I would say it seems like Aero and that's probably why it it's still sort of like kind of a background topic is that I feel like for the most part a lot of the Aero ecosystem, including like this driver, is there's it's almost like the tools that are used to build the other tools, right? They're sort of like happening behind the scene.

28:24Mehdi: Yeah, so it's happening behind the scene. Because like for example, if Power BI, yeah, served their I think they have a datab also A D B C client. then yeah it's mostly a configuration and you don't even know if it's ADBC, you're just setting up the stuff and that's it.

28:35Daniel: Mm-hmm. Have you guys heard of I mean you're around Duck D bee and mother duck so much. Have you heard much rumbling? I mean, are people using do you ever hear anything about Arrow and Ducky being the same? Are people like starting to use stuff or?

28:54Mehdi: So I do you I do you use we do use it internally a lot as a standard data frame. just because it's pretty efficient if you're in the Python word and you need to go back and forth with the DuckDB word, right? So to avoid serialization. So typically to give context, if you ingest data from an API, you it is stay in the

29:04Daniel: Mm-hmm.

29:19Mehdi: world of Python data structure like dictionary and so on is pretty slow. So if you go as fast as possible from the ingestion and row and stack into arrow data frame and table, then the cost you know to that to a duck db table is really negligible versus basically the cost of like serialization from python data structure to to duckdb I mean at scale. So that's that's where we've been using Arrow for the protocol specifically. I think you mentioned you you summarized that right. I think it's gonna happen behind the scene and people is not going to to know. And I thought it just takes time. Takes times to adopt I think that kind of thing at the at the industry level. but yeah, it's true that Arrow probably doesn't get enough love that it should it should.

29:47Daniel: Mm-hmm. Mm-hmm. Mm-hmm. Yeah. For sure.

30:11Mehdi: Yeah. all right. next do you want to talk about this one, Dumkey?

30:17Dumky: Let's let's do the other one first, solo term. I think that's yeah, it's a different kind of topic than than what we talked about so far. So this is if you if you remember well, so this is you remember Aaron Francis who was at Planet Skill, he was i if if you don't know the guy, he's a great educator on YouTuber as well. Yeah, yeah, yeah. He did a ton of great stuff. has really good youth videos on on more of the

30:20Mehdi: Solar term? Okay. What is this? This is conductor. Yeah. The database educator. Yeah, yeah.

30:42Dumky: transactional side of data, I would say. and he moved away from Planskill, does his own stuff, and now he's built this thing, which I started to play around with actually yesterday or today. and it's a different way. And the reason I wanted to point it out is that I am still trying to figure out workflows around managing my own code, code that agents write, code that sub agent write, and then feedback into agents and doing that across different projects, across different Git repositories. so I've like many said, I've been using conductor a lot and it's been for now one of the easiest ways I think to manage different projects with different agents running at the same time. conductor to build, I think it is. This every everyone calls their thing conductor. yeah, it's one of the meta so apparently AI is an orchestra that you can run. I think that's probably a better analogy than and having your agents as slaves and you're the okay, so

31:33Mehdi: Jesus. There's really a brand. Yeah. But so what's what's the take on like the versus conductor, which is basically agent analytics agents?

31:49Dumky: So the nice thing about SoloTerm, I think, is that it's a little bit more opinionated on the workflow. And what it does is it gives you like more like a notebook that you can interact with so that the agent comes up with a plan and that makes it easier for you to adjust it. And then what it does is it spins up sub agents or like full a full blown team basically. and then the main agent can set timers to regularly check in with them. they can report back to that. Yeah, yeah, yeah. And but the the the the trick there I think still is to really nail down the the UX user experience of how this all works, right? And I think that's what's what he does quite well is that you get these little trees.

32:19Mehdi: Okay.

32:20Daniel: It's like a workflows within workflows, it's like interstellar in all right.

32:40Dumky: That you can dig into of what is my sub agent doing and what tool call is my sub agent doing and what was the result of that tool call without having all of that in like one big place where you can't find it anymore. Right. So the trick is to kind of be able to find that one tool call that made a mistake or something or had an error while not being having this overflow of, yeah,

33:03Daniel: But what if I want like fifty terminals open?

33:06Mehdi: But yeah, yeah. I mean you you

33:08Dumky: Yeah, so I so this is what I'm playing around with, but I'm mostly curious, like how do you guys how does anyone manage all of this these days?

33:16Mehdi: I already gave my like in the previous podcast I can repeat myself, but Daniel, how do you how do you manage multiple agent today?

33:22Daniel: just multiple terminals, yeah, in my Mac. I I'm a terminal guy, like I don't know, it's just

33:25Mehdi: Yeah, what what do you use? Like are you customizing stuff like T Max or whatsoever or just like plain Yeah?

33:32Daniel: Yeah, like I'm just like a Vim guy. Just old school. I mean that's part of that's just like that's the air I grew up in, you know, and the command line and yeah, I don't know. I'm I'm definitely not doing no gas town stuff over here. I just, you know, I try to keep it clean, focused. That's part of it too, right? I try not to get too much stuff going on at once. It's like a fine balance, you know what I mean? It's like I want good results, I want to be efficient, but at the same time I

33:43Dumky: Yeah.

33:56Daniel: You know, I'm shipping stuff at Prad and I don't want to break, so I don't know. It's tough how to find the balance right. This looks sweet though. I'm gonna check it out.

34:02Mehdi: I have the I have a blog related to this, but like so my workflow is yeah, also to TMAX and various things. I just had multiple functions to help me search for the search to any conversation and opening directly in cloud, having a view on I think it saw that he showed the port open here, or maybe it's conductor. Like if I have a web server running, I can see, yeah, here, for example.

34:15Daniel: Mm-hmm.

34:27Dumky: Yeah, yeah, yeah. That's that's w the one thing I like about conductor, yeah.

34:29Mehdi: I see that also in my terminal T Max. So that's like those kind of like little thing. And if I close the terminal like the session today, it closed the web server. Like it's more stuff, like to avoid to have like 10 go stuff. but at the end of the day, I I'm still discovering and what works. I do feel terminal for me is nice because it's easily customized. But those those things, I think that's what I said in previous podcasts, is that

34:40Daniel: Mm-hmm. Ha ha

34:55Mehdi: I do like conductor and those things just to get inspired on what is the workflow look like. What I think it's gonna be like those things are interesting is that how they manage different kind of agents. if you're lying on Cloud Code and Codex and you're not, you know, going to open code, for example, because also all those labs are closing kind of like the door to be able to just have an API key or like, you know, raising the price.

35:00Daniel: Mm-hmm.

35:17Dumky: Yep. Yep.

35:18Mehdi: So I think you need to have kind of high agent harnesses like we've like, okay, how this is how I use the codec CLI and the cloud codes. Now the other question I have is that those same labs are investing in those workflow, right? I don't know if you've tried cloud cloud code workflows, but you can set up now s you know, multiple agents and

35:40Dumky: I so yeah, I tried it I you mean like the multi agent thing. The word sorry, I was I thought you meant the routines. I haven't tried the workflows yet with like the multi-agent stuff because I'm worried that gets too expensive. But just figuring out so so this is my my my thing about the UX right. Like I realized that if you wanna have if I wanna have like a skill that's specifically for Mododoc within our organization, it cannot go into the default cloud.

35:42Mehdi: Not convinced?

36:07Dumky: AI interface because that's not version controlled and not managed by our organization, but it has to go in like cowork because that's the only place that you can have plugins for your organization that are version controlled. And then you but cowork only allows like specific subsets of of things that you can do because you can't like fetch certain web pages because those are shielded off. So then you go to cloud code.

36:19Mehdi: Yeah.

36:34Dumky: And then cloud code can run locally and in the cloud. And if you accidentally schedule the cloud version, it doesn't have access to your local MCPs. And this drives me crazy.

36:41Mehdi: Yeah, there is a lot of limitation here and there.

36:46Daniel: We should take a sidebar. What do you guys think is gonna happen? I feel like all of a sudden it was kinda like a bait and switch we all knew it was coming, like everybody's talking about token cost and I feel like up until this point it was just like it's just a happy world. It's sparkles and unicorns. We're all using these tools, they work together and all of a sudden it's like the teeth are coming out, right? You know, people wanna make like it's like, well, all of a sudden the money matters and the teeth are coming I didn't think it's gonna be interesting what's gonna happen because we're all hooked on this stuff now, right? And

36:55Dumky: Mm-hmm.

37:00Mehdi: Yeah.

37:13Daniel: It's like, what's gonna happen tomorrow? you know what I mean?

37:13Mehdi: Like drag sidic. Yeah. I think I think those as I said, like for me, those will be included in the labs. The problem is people want to have not models locking workflow, right? I think there is huge value to have, you know, codec review, cloud code and so forth. And also some models are really sp more, I think, specialized towards specific tasks or you know, behave certainly. So Yeah, I don't know. It's like I see the I see labs taking a lot of this, closing the door. but

37:47Daniel: Have you guys tried any local models yet? Have you guys done any open co have you found anything that works locally, reasonably or

37:52Mehdi: Yeah, we there was yeah, there was that's a good point. There was a great talk at the PyCon from tower.dev about small models. So small models are like trending now. I think that could be like you know a good compromise. But the problem is that if if the labs doesn't open like close their gate as they are doing right now, right, to keep people only in their ecosystem.

38:01Daniel: Mm-hmm. Mm-hmm. Yeah, right. Mm-hmm.

38:16Mehdi: It's gonna be hard. Like I would love to switch from a local model and sometimes use Opus, right? But to have that in a seamless workflow, does that feel realistic in the future?

38:22Daniel: Mm-hmm. Yeah.

38:28Dumky: No, they're I I mean, obviously they're gonna build their moat within their ecosystem and all the connections that you have there, right? But I do think there's a lot of pressure from these open source or open weights models that will prevent them from rising raising their prices too much, I think. And there's a lot of interesting stuff that Google was coming out with I think this week or last week as well.

38:28Daniel: No. Mm-hmm. Sure. Competition, right?

38:44Mehdi: I think corporate may have a

38:51Dumky: of having like models and agents just everywhere in your browser and doing a lot of local stuff with that as well.

38:56Mehdi: Yes. Alright, I have a bl a blog related to this, which is the you were speaking about Google. It's from Adios Mani. I don't know if you know the guy, he's pretty popular. he's at Google for years, was at Google Chrome for a long time and now he's on Google AI. And he had multiple books and I found this article pretty nice. He part he talked about the orchestration tax. That's what kind of we we talk to say that. basically you are the single thread resource. And how do we yeah, how do we sorry, we are getting depressed again. Probably we're gonna have like hopefully a blog which is more optimistic. but the the the deal here was saying that yeah, your brain can, you know, only pay attention to something and it's really there is an asymmetric tax where it's easy to start an agent, it's hard to close the loop.

39:24Daniel: Is that all I am?

39:30Dumky: Ha ha ha.

39:30Daniel: Mm-hmm. Feel that.

39:48Mehdi: And so yeah, and so he he was mentioning that he has a specific way of working where basically you have a pile of I have a really complex tax task where basically I need to give my judgment on this to you know towards an architecture or a weird bug. And the others are more like deterministic, like okay, I need to, you know, do those those tasks and they're small things, and he's try not mix

40:04Daniel: Mm-hmm. Mm-hmm.

40:15Mehdi: Those two paths. And I think I'm getting there also. It's like often when I have a complex thing, you know, I've tried to be greedy and open three things, but I realize meh, I cannot like reply to the two complex projects. I would rather focus just on one complex thing and other small. But yeah, there is a specific tax repay today, even if you have a good way of orchestrating.

40:25Daniel: Mm-hmm.

40:38Mehdi: And to plug like a third blog because we are it's also related. Mitchell said that so creator of AshyCorp, I mean one of the founders of AshiCorp, Terraform, and you know now working also on the popular Ghostie, he said that he had an agent loop optimizing a renderer, and the results was from 82 millisecond millisecond to two milliseconds. But actually that was not the whole story. allocation were down and it sounds good. but the he explained the technical reason why, but the result was actually worse than before. And the point is you always need to review, and that's so that's depending on your test results, right? And if we come back to an example of a data pipeline, you could say, Yeah, sure, the pipeline is faster, but maybe you're storing the t data in a wrong type. So the read is gonna be, you know.

41:24Daniel: Mm-hmm.

41:30Mehdi: counter performance and the model doesn't have this context on it, right? It's just say optimizing the data pipeline speeds, not the read speed from the consumer. so I think this is related also to somehow the tax sequestration that it's just hard to to to still review and put test on those big loop. I don't know, how do you guys do easily?

41:51Daniel: Yeah, I get it. I've

41:52Dumky: Did that to me that is kind of the if you think of what we as humans have as context, right, compared to the context window of of an agent. There's just so much experience that we take with us, so much knowledge, like small things that we've encountered once that still kind of seeps through in all these little decisions and and evaluations that we make. and I think it's especially your example of the the type like type that you write is is maybe not optimized for for what you want to read. that's just one of those small things that that you have to take into account and that a model will only do if it's like very obvious.

42:32Daniel: Yeah, I feel like the tech culture now we're like between two worlds where it's like you're expected or you feel like you're expected to be that ten X, hundred X engineer and Get stuff done and like why shouldn't you have this big project done, including architecture in three hours or something, right? And in our heart we know that like that's not the best way to approach it. We need to you know, your experience like you talked about tells you that you need to I should take some time here to think about this and maybe even take a day after that agent's worked up the architecture. Like maybe take a day, just don't look at it, come back tomorrow and this isn't what I want, this is what I want. Yeah.

43:02Mehdi: yeah, yeah. Definitely. It it this applies to a lot of things, right? It's like sound engineer, there is joke of like sound engineer, they give a do a mix and then the next day they say, What the hell it is, it's horrible. Video editing, it's the same. you know. It's it's just it it helps you to get perspective and zooming out. it's true and that takes time. There is nothing like you cannot buy time for for this and this time now.

43:18Daniel: Mm-hmm.

43:21Dumky: Writing also.

43:28Daniel: Mm-hmm. Yeah, and we're still moving faster, right? Like just 'cause you take a day. Something that took you three weeks before, just 'cause it doesn't take you a day, it's okay if it takes you five 'cause you're still, you know, doing a good job, doing, you know, taking your time, but you're still moving a lot quicker. You don't you don't have to do it in an hour, you know.

43:39Mehdi: Yes. Yeah. We're trying probably to squeeze the time too much for quality. I think it's exactly right. instead of three weeks it could be five days. But people is gonna squeeze, say, Hey, but you know, there is a new model four dot eight. Why can't I you do it in five in four days now? And I don't think it's at this point, I don't personally feel it's like almost quality of the model which is a blocker. What do you think?

43:53Daniel: Mm-hmm. Right. Mm. Yeah. Yeah, no, not at all. I don't think it's definitely the people and the time you take and your questions, your experience. I think that human experience you've had building task pipelines or architectures at all has a huge impact on the end product. I think that AI at the end of the day is just a multiplier of who you already are. I know it's a little bit simplistic, but if you kind of w winging it before, you know, it's just gonna make you wing it ten X. And if you really cared about your craft as an art form and take your time, do it right, and you care about You know, I don't know. Yeah, you just have a care for what you're producing and the long term impact, the business, downstream, you know, that it's just gonna make you better at that, you know, if you just slow down a little, be yourself, take your time, use it.

44:56Dumky: So so do you feel that you're the way you spend your time is is different now?

45:01Daniel: Yeah, I would say well, yes and no, I would say I spend less of my time on the details. So I can I feel like I can trust AI more for the details these days. And by details I mean like what's inside that function, right? Like unit test it, write it. Expectations, you know you got skills. This is how we expect this to be done. And I feel like I spend I did spend a lot of time doing architecture systems design, like high level thinking, but I feel like I s have even more time to spend doing that in AI is a great help to like bounce ideas off of what am I missing, you know, like kind of straw man things, steel man things, what's going on here and I spend I think, you know, you can even have better end results, right? 'Cause you're actually spending time on the very important stuff at the beginning where maybe in the past I would have been rushed on architecture and like spent 'cause I know I needed three weeks to write the functions or something, right? Whereas now that's like

45:50Mehdi: Yeah, indeed. So do we want we're getting close to to the end. Do you want to go quickly on the on the last one? yeah. I'll save it. I will save it for next time. we'll take it as Databrickserobus Eventstream Lake House. what it just can you just tease us on that? What is it?

45:58Daniel: Save it for next time, huh? Yeah, it's like if you haven't checked out ZeroBus, it's like it's specific to Databricks, but basically Lake House architecture, people, you know, integrating streaming into Lakehouse architecture's always been difficult. Custom, blah, blah, or there's Kafka connectors into your Delta Lake or Iceberg. It's always been this problem. And now Databricks has like this. You can check out the article or go Google, it's just like a couple lines of Python and you can be streaming in your data right into your Lake House, the couple lines of code. So it's just simplification of streaming architecture.

46:28Mehdi: Okay, building.

46:37Daniel: Which is seems to be happening a lot, right?

46:38Mehdi: Okay. Yeah, in a lot of plays.

46:40Dumky: So the

46:41Mehdi: Cool.

46:41Dumky: the successor of Delta Live Tables.

46:43Mehdi: Yeah. Yeah.

46:43Daniel: Could be. Who knows? Delta Live too. Sparks man. Don't even talk about that. Yeah.

46:48Mehdi: was a pleasure to

46:47Dumky: Okay, next Next time.

46:50Mehdi: have you here and yes, thank you for listening to us and again motherduck.com slash explainanalyze if you want all the notes and be notified on the next episodes and we'll see you in the next one. Cheers.

47:03Daniel: See guys.

47:04Dumky: See you next time. Thanks, Daniel.

47:05Daniel: Bye. Yep.

47:06Mehdi: let me find stop button there.

All show notes unlocked