Join Cody from Voltron Data and Mehdi to chat about the Ibis project! Ibis enables a single library to query many backends, including DuckDB, Snowflake, Spark, Polars and more.
Transcript
0:00when building the pipeline is pretty common that you would like to switch to a different execution engine would it be for migration purpose pointing to another database with the same pipeline or for efficiency using lighter framework for one node processing and still being able to use distributed compute like spark when scaling and what if you could keep the same code base but
0:22change one line to point to a different backend for your execution in well that's the promise of Ibis and that's what we are going to talk to CI welcome your uh product at Voltron
0:36data what is this company all about because I was talking and and doing an introduction about Ibis can you maybe introduce yourself and how are you linked or not to the IUS project is a company that formed um really out of a few open- Source companies a few years ago you can go and kind of read about
0:56that history um but they make up some of the main contributors to a lot of the uh popular open source Frameworks and data um namely Apache Arrow the IIs project and substrate um are kind of the big three um so we contribute to those open source projects we offer enterprise support for all of those open source projects um and then the main product
1:18that Voltron data has is called feus which is a accelerator native GPU big distributed data engine um they also uh of course support the Ibis project which is the main thing that I focus on U which is a separately own uh governed open source project um yeah and so that
1:39that is uh that is is interesting before diving into the story can you give us some timeline into I I because it start to uh to pick up now in ter of fraction in the community um I see some people I've heard it in the comment but uh actually start is it actually a new fresh new project and the volon data
2:02that in into that project yeah that's a good very good question um so ivis is actually almost uh 10 years old I think it's n years old at this point wow it was initially created yeah it was created in 2015 by Wes mckin um who is of course the creator of pandas um co-creator of patero and one of the co-founders of
2:23Voltron data um so that is um partially how Voltron data got involved um they of course knew about Ibis from West um and the other reason why WRA data is involved is rather than um this kind of trend we see of each database vendor or um everyone creating their own data frame API for their backend um vron data
2:46likes to support these Open Standards um and if Ibis kind can grow into that open standard that works with every database back end um then it's just a single API that everyone has to learn um so yeah the project is quite old it went dormant I I would say a few times um and Voltron data really picked it back up in early
3:072022 um hiring um many of the current Engineers who are on the team um so it's been getting a lot more traction of the past couple years yeah no that's uh that's really interesting you mentioned uh the other uh main project from rron I
3:24want to uh TCH down on this uh thesis
3:29right is that that's correct y uh let me uh share it uh the audience uh
3:38to it is actually basically kind of the
3:43first data processing Frameworks on on jpu correct me if I'm wrong actually haven't heard any of it yeah it's not necessarily the first um so many of the founders and people who work at vron data had worked Nvidia before on the CF project um another thing well um so CS is kind of pandas on gpus um you can distribute it through
4:09dask um but thesis is basically a round of reite or first time right of a big distributed is system that runs on um you know many gpus uh it is really operating in that uh feus is not for you know 99% of people uh for 99% of people something like TV or poers or whatever is probably what you need but when you
4:32get into that um 30 plus terabyte range just kind of come that yeah I like you you give already you know some uh you know size uh um bucket because I think
4:46that's that's the whole philosophy that the data ecosystem is going through is that we used to have one mainly dat a processing framework right um we started with I do don't distribut it to skip and then no down of like going down slowly because you know single node machine and laptops have been more powerful but then we are in the in this
5:11situation where we have to kind of embrace different execution inine right
5:19and and maybe maybe it's that uh is that premise that when does like actually IIs dive into this single API uh you know approach and where did you know that design was triggered from yeah um so the that is the thing IIs with itis is that whether I'm working on my laptop or on a PC file VM
5:44uh or with one of these distributed Frameworks whether it's ppar or trino or feus um I can use the sing addi and then use the sing code whether I'm doing development as like data Sciences or productionizing something a little of larger Planet uh I this was initially developed as a um for Impala actually um Impala is
6:09a little less popular these days um but that was first to a couple of years ago that yep um so it started with Impala then I think the postest backend got added and SQL light and uh more and more back ends over time so now we're up to uh over 20 back ends um yeah uh whichever back end you choose whichever
6:29one's the right uh tool for your size of data whatever job you have um you can use them all using a single API yeah so I have uh actually here um on a re slide kind of a history right that uh where yeah just the data frame hisory ecosystem because I feel it's super interesting that bundas was first
6:54and actually bundas was uh inspired by you know the r programming language in this data Prim approach they were first uh and uh and then we had spark uh
7:07desk deer um IIs uh 10 uh qdf to B GPU
7:15but this is uh actually I never seen that much use case on this one um is it because why why do you think it's because like it's mostly a specific GPU that you need to be able to run those yeah you need um Nvidia gpus they only work on Nvidia um some people have those on their laptop um but it can also
7:37be hard to set up and you can run into um memory issues like until kind of recently um gpus have not had yeah uh many gigabytes of RAM um so it's very fast if it works for what you're doing but um it can be a bit more cumbersome to set up and um easier to hit out of
7:55memory issues yeah yeah yeah um and then we had uh modin uh on uh rat has been
8:05acquired right I think so uh so modin is
8:09an open source project still lives on um the company uh Ponder which offered modin on top of um uh certain uh
8:19database providers was acquired by snowflake um yeah yeah that's uh that's correct um then koalas which has bondas on spark I'm wondering where where it is right now because I feel um yeah a lot of like I I've seen data brick invested a lot into bpar just that yeah um and I feel with the other competitors that
8:45we're going to see that it's it's just getting more then we have poar uh which is multie data frame in Python VR rest so I so this is where the the API uh
8:58Journey started or earlier is that correct roughly it's yeah so it's been around since 2015 but really in 2022 is when the um much of the current team that works on Voltron data was brought on um started working on it full-time um and you can see in like the commit history and everything else um that's really where um the project took
9:20off and then um you mentioned snow bar python which is again another different things and Big Data frame B on and Google so they are using Ibis yeah this one's interesting so one of the points of these this slide is um you know we with Ibis we kind of run into the classic standards problem of okay there's 14 different data frame
9:47libraries and Ibis is trying to be one to unify them all um we kind of see a few different approaches that many of these take like many of these are pandas on X um but we kind of believe um that
10:01pandas is not really the right API for distributed computing and um there's a lot of baggage with pandas which to their credit they're doing a lot of good work to make pandas faster and integrate a pachero and a bunch of good stuff there um but uh the last one is a very cool example because big query wanted to
10:19build a pandas on Google big query and rather than start from scratch and do everything on their own um they instead actually wrap Ibis um and basically provide the Panda's API which goes to the iOS API which then goes to Big query SQL yeah so there there is a bit of a r trip uh guess yeah no but that's so that's that's
10:42interesting to to see and and and next to that um uh we have uh basically kind
10:50of like three approach if we summarize um in in the data framew right the pandas clones as you mentioned snow ppar clone and something else um but there is
11:02also a raise of SQL right lately y um
11:07and I think it is mostly uh why why is that why is that so do you have do you have some thoughts on that so SQL in theory is a um standard
11:22there is an ansy standard for SQL but in practice uh every database has their own dialect of SQL um and you know it's not really the fault of these um people like Duck TV calls it friendlier SQL and um yeah like they get a lot of requests from customers users and people want this and that and they
11:40add them to their own dialects um but that brings us to an issue of it's then hard to kind of go between different SQL dialects um and there are also good open source tools popping up for that like SQL blot which IIs uses fairly heavily um and uh yeah the other thing um uh I I
12:00very much come from this data frame World um I've worked primarily with data scientists and machine learning Engineers before um then I hopped over for a little while to DBT Labs which is of course Very SQL heavy um and I kind of got to see this other side of like people who um uh perhaps kind of like
12:20you said at the beginning like really don't like data frames and uh just want to sit in SQL and in that um terminal so there's very much the two different worlds of like python data frame people and people who just prefer SQL um and kind of in both of them you run into dialect issues or API issues where
12:37there's one data frame for every database um or every you know query engine um and maybe one SQL dialect for every one as well so unifying those is a a very cool promise that um IIs tries to do at least on the python side yeah and it it's so you you point out that SQL is you know there is an NC Norm SQL but
12:58this one is you know evolving not enough fast than the industry needs and I think that's why every single database build their own you know function I think recently there was some uh Jon features had it last year but like for seven years there wasn't any updates and if you feel like seven years in uh in the data world it's
13:19like you know uh more than a decade we everything that happens but there is definitely that but I feel like your journey is actually you know kind of like synonym of the the SQL rise where m
13:32u you know enable to lower the technical barrier to entry so you don't need
13:39knowon and you just uh run simple but initially byly I mean distributed by pling or writing in spark and Scala spark and that was like having you know data frame approach um but it's interesting to see that basically started with data frame we turn to SQL
14:00and now we kind of like seeing you know the raise of like data frame Library again with like polars and and others and Ibis right so it's kind of yeah full
14:11um uh full circle when do you think um
14:16all so yeah I can just quick there is uh
14:22a slide I want to so two slide I want to show uh there is this one and there is
14:29[Music]
14:34uh this one uh is basically show directly you know the added value of IID that you write a single uh in a single senta and then you can use a different back end would it be bigquery click house so there is a bunch of them here on the sequel site and then there is still data frame B us ORS and I could P
14:59spark bpark is over there to to rece Yeah we actually used to use the ppar data frame API um but recently switched over to using SQL directly um which makes it a little little easier to manage yeah I guess so and and and data bricks have been also invested a lot on their their SQL interface I think to to
15:20the extent that we just uh we just said a lot of people are are are using uh SQL too um and so then there is uh this slide that that I really like because it show basically in one picture what this does um congrats on that I don't know if you're the author of this slide um but uh if you want to run Doug
15:44DB as a back end this is basically how you do it uh get your data frame and you can uh do a group by if you do poar if you want to use poar this is how you do it and actually as you see if I change yeah there is just the first line which is changing right um can you can you bit
16:05elaborate on how how this works yep so that is the ideal is that to change your backend um and going from yeah duct DB to polers to data Fusion uh or duct TB locally to mother duck remotely or um you know pis spark remotely to or locally to remotely um all you change is basically your connection there and your connection
16:26string um it has this is just showing you know local back ends because it's a pain to set them up for Snowflake and all these other things um in these slides here um but yeah you just change your connection out and then the API is the same and what you get out of the back end is the same um there are you
16:45know of course edge cases and like how different backends handle regular expressions and um a bunch of different considerations but for the most part IIs does a pretty good job of making the experience and the uh the code that you get uh or yeah the SQL code in particular that you get out of them um uniform across back
17:04ends yeah and uh related to the different ma than the development we have an interesting question from Christian here so how does IIs feel about focusing development to creating and maintaining connectors for many backend compared to trying to accelerate adoption of Arrow data Fusion substrate so IIs just maintains to one connection and push the back in connector to the
17:29data Fusion or substrate level does that make sense so basically I feel the question you you know what's uh it SS
17:39you know another layer for you to to maintain so how do you what what Pro and calls do you see today into dats yep um maintaining many backends is definitely a big technical challenge um one of the uh so there are a few things that make this easier um one are these kind of Standards so Apache area as a memory
18:00format standard um we can typically pass data um between the backend and the client or even between backends through Apache Arrow um less mature but um
18:12promising in the future is like the Apache uh database connectivity addc um
18:19yeah addc um so if every backend supports that in the future then IIs can just connect through ADC and have these standard connections um right now us the DD apis for the most part for most of these things um so uh we want to kind of
18:36uh as Voltron data and also as the IIs projects we want to um increase the adoption of these standards that make isis' life easier and ideally everyone's life easier um substrate is another good one um so substrate would kind of replace uh not necessarily replace SQL um but act as an intermediary representation um so that Ibis or SQL or
18:58um any other other of these Frameworks could compile down to substrate um send that to the back end and that would help alleviate some of the uh seq dialect management that IIs currently has to do um I mean I'm familiar with arrow but I think it's important maybe we we give uh just some people the hasn't heard of Arrow because I feel
19:22always and seen it start to be everywhere now so if you can get like a bit of like context of what a really sure um so yeah Apache Aro is a uh you know another open source project um under the governance of aache um it was co-created by Wes mckin as well as other people um many people will TR data work
19:43on it and basically it's a standard for um specification for inmemory data um so
19:51it's uh what duck DB uses what poers uses what velock uses um what thesis uses um what a lot of engines have started to use as their way of representing data that's held in memory um so it's you know how do you represent an integer or an unsigned integer 64 um
20:11floats decimals strings all those things um and back in the day when uh this was getting created every system kind of had their own way that all were very similar for performance reasons but not quite the same um and yeah really in the past few years Apache has blown up it's uh everywhere um py is I think one of the
20:3350 most downloaded python packages these days wow um which is quite impressive yeah um it's uh it's used pretty much everywhere um so it's not really something that most most end users of data tools need to worry about um but it is something if you're developing a database or uh you know kind of looking at data formats um is something you
20:55definitely need to be aware of yeah no that's uh that's a pretty good summary I think uh that's also like if people are comfortable with dgdb that's what enables to convert back and forth between you know um DCT um aable to you know abunda data frame or polar data frame without uh too much cost and can without any almost
21:19overhead on the compute and just to to give you an example before you could uh I'm not sure what's the status because I'm a bit rest uh on spark but you could also convert spy spark data frame to pandas since a couple of years but there was this work of mapping internally to a panda data frame which is I think I
21:40believe not happening right now everything is through Arrow so it's much much faster and Mer efficient so as you said it's it's really great for the community to adopt those standard I'm really glad that company like you longterm data is supporting that because it makes every single end user you know life easier if they want to switch to um
22:01different engine or different data frame but also uh vendors or you know new library to integrate uh more more easily I guess it it would have been much more also a challenge for you right to integrate so many back end I guess without without Arrow yep um Arrow definitely plays a big part um SQL lot which is a open
22:23source library for uh cross SQL dialect translation um it's a lot of words but um it makes it a lot it makes our job a lot easier um managing all these different SQL backends um and yeah also just a lot of hard work from the uh Engineers over the years um stringing all this together and uh making it all work yeah this is this
22:48is the project you were mentioning right um yeah uh SQL glob which is a SQL parser and transpiler because as we just mentioned uh a lot of database there is a SQL stard but a lot of database doesn't respect that because mostly they build uh you know nice function I'm using them a lot in in wdb uh because it makes just
23:13uh your life as a developer easier uh but the count of that is that they don't speak the same function on other uh other database um I want to let's let's dive a bit into because uh time is running uh SRA but I want to discuss a bit around
23:34um of course Doug me and why actually
23:39did you choose uh Doug DB as a default backend for For Ibis yep um I will
23:51also slack you a blog post on this so we
23:56um Philip Cloud the lead developer um wrote up a a nice blog post on this a few months back um but essentially at some point Ibis supported um many different backends including duck DB but did not have a default and that kind of added an extra layer of um difficulty to
24:17get started with IIs um it out of the box you have to like create a connection um you have to think about what backend you're using um and until you do that you can't just read CSV or read parket files and do all the things you want to do um so duck DB kind of came onto the
24:32scene as this embeddable olap query engine that's very fast um very simple very easy to use and pip install um a back end got added for it into IIs and uh the team and Philip I think at the time decided to make it the default which just makes it that much easier to get started so um once that was the
24:52default you can basically use Ibis like pandas or any other uh python data frame library with without even really thinking about what back end you're using and it'll just be fast and efficient by default um using dotb um yeah this is this is the the the
25:10this is the blog that's yeah I sorry I think our yeah you
25:17have those distributed uh compute framework but it's a bit hard to to set it up right for someone to get up to speed when discovering the tool um so that's uh that's what you mentioned and here basically P has been there for a while um and SQL light but sqlite is uh
25:38is not suitable for for for analytics and so the fun fact is that we we start to see actually a lot of um people having so you are embracing the ddb you know as a defa back and reading your product but there is also actually a lot of product now uh starting to getting started with DB because because they're
25:59getting started rely on a database server cloud data warehouse so to namely we're discussing about DBT they have they getting started is using dgdb because you don't need to set up you know anything in the cloud uh to get started with DBT and there is other tools uh would it be data quality whatsoever if you want to have you know
26:19a quick uh getting started tutorial um it's true that you just do a people install in and um and you're ready to go um and so regarding uh du TB and uh and
26:34IIs I've seen also uh something else regarding the The Benchmark uh you have another blog let me just quickly open this
26:49one where I see you've done some uh some
26:54performance uh here with ibus Plus
26:59something else can you can you give your thoughts about uh about this here and I think uh this is also really interesting
27:10those two lines for me yeah um so this
27:14was a workload that Philip Cloud uh was analyzing I think pii data um and he was doing a lot of uh string um string processing in here um the code should be somewhere um in this blog and uh yeah he found that for this workload in particular it was like 10 times faster running on Duck DB um I
27:35wouldn't take away necessarily of course the duct DB is 10 times faster than poers um it was 10 times faster for this specific workload um and part of what this demonstrates is that um number one it's very hard to do benchmarks um they're very workload dependent and you know uh you can run tpch queries and there's like uh standard things that
27:58people do um they're fairly easy to uh I
28:02don't want to say game but like um make your tool look better um so you really need to analyze yourself and see what works for you um and that's also one of the things IIs makes easy if you want to run a benchmark on Duck DB and polers and data fusion and whatever else um you can just write your code once uh switch
28:21your backend connection and see what works best for you um but yeah um this blog also led to I think think quite a few performance uh reports um I know an issue was open in data Fusion I think one was open in poers for this um so I'm guessing if we reran it it would probably be faster for both of those as
28:39well um but yeah just demonstrated that at the time for what Philip was trying to do um this workload ran nice and smooth on Duc TV and much worse performance on the other uh two of the other back ends that he tried out to that you know taking a lot of precautions and and caution regarding Benchmark it's funny because we talk
29:01during the uh The Landing Cod uh you
29:05know you mentioned uh TPC H which is
29:13bench like set it you can game it uh so do your own research and it's depending on the workload um but all of that say I think what is really interesting for me even more to what you say to an extent is this is that you actually don't have any overhead apparently compared to Polar native where do you where would you see
29:41specific overhead of Ibis versus yeah um yeah it actually shows that it's faster which I'm guessing that's just due to variation but basically for any real uh larger data size you're running on you're going to see effectively no overhead from Ibis um it's not zero of course right um Ibis is uh you have to load in things in Python
30:05and um Ibis does a little bit of you know computation or compiling of your Expressions um but most of the time for most data workloads is really spent on transforming the data running that SQL code or uh polar staf frame code in their case um so yeah Ibis uh with very minimal overhead um uh basically allows you to interface back end taking any any
30:28sort of performance it um there are of course edge cases and bugs that come up but um we're pretty good about fixing those if they get reported to us yeah oh that's uh that's great so
30:42what where do you see uh basically the usage uh patterns popping up when using Ibis today what's what's the common in case within know people I'm having already my data tack uh you know what's the the common thing that you see repeating at your current users yeah we see a few things um one is just the basic kind of analytic
31:09analytics or data science case where you do have to work with more than one backend um so like for an organization that's using Panda or their data scientists are using pandas locally and then you're trying to rewrite that into py spark or SQL um to deploy it um that's a pretty good use case um we're seeing some more adoption from library
31:29developers so like the Google big frames that we saw um if you wanted to build a panda's clone for your you know your database today um I think that the work they've done there in open source um and just in general using Ibis you're probably going to have a better time than trying to start from zero um and
31:47doing it all yourself um so we see those kind of uh um Library developers um and then we also see people who like to mix and match SQL um this is an interesting quote from a user that was like I like that I can just run my you know SQL strings through Ibis um and use that when I'm familiar with SQL um but if I
32:08want to do something more complicated or something that's better to express in a data frame syntax um you can uh just use
32:15the native Ibis data frame code um yeah a few different use cases but um yeah those are some of them yeah no it's so it's it's really nice to uh to have this being as you mentioned I I haven't thought about the those those database that want to have a data frame interface but that depense I mean b query have
32:38that for for years now and I think as soon as you want to have you know more like ETL pipar uh a lot of people are using you know Python and so data frame approaches it's easier um but this is something we haven't touched on and after that um we might go but then they're getting started uh for thing but
33:01I'm I'm curious what what is your view on uh we we mention a mid data frame versus uh SEO um regarding adoption um but other than that
33:14why someone to use a data frame approach for a data pipeline instead of SQL I would personally recommend it all of the time um I'm not a huge fan of SQL but um there are there are of course pros and cons and a lot of it will come down to I think just individual preference um what data frames give you
33:34though and really python is um you know we kind of talked about DBT earlier lowering that bar to following software engineering best practices um python kind of gives you that on an even on another level where you can just test your code or unit test your code with P test um you can leverage a lot of the uh
33:51Frameworks that are out there already um so if you're trying to do something very um complex or more repetitive um you need to use for Loops you need to use these kind of General programming constructs um that's I think where python and a data frame library is uh quite useful uh over just PL SQL yeah so SQL is a structure language
34:14is not a general programming language and as you mentioned there is limitation uh to uh to its language and also all the testing I think uh I think the the the UN ad option of SQL and that's one of release and focus on DBT by the way these days is on DBT test because poer tend to skip test because it's actually
34:37you know harder to write a pink I believe you know test well it's with simple uh if you don't have like a framework around it like NT for example um but yeah when you exactly when you mention if you're using python or another programming language for data frame approach you can basically uh reuse your U testing Library like P test
35:01and that makes the thing easier I feel as so uh but that some people would argue uh readability um I like I think it's just because it remembers me the times of uh of by spark in in Scala but um you know I think I mentioned I'm just in the same blog see this for example um so chaining those operation
35:27feel it's easier um through the bra uh than than in syn but again I think that's mostly um uh a matter of
35:38preference uh but then again you have if you are in ID also you have an autocomplete on specific function and other information on on the given method which is harder I'm not sure if it's even possible in SQL you may have like your AI friend uh in CA in copilot for example but it's just a suggestion uh but yeah that's uh
36:03definitely easier testing and just uh the workflow but it's also a a matter of
36:10preference and for that um I suggest we have 20 minutes to go into uh getting
36:18started um so I actually have uh an a notebook uh here a simple notebook uh that I'm going to just
36:30um uh just uh brand the installation
36:35again because it's been uh just a while so you install Ibis like this and I guess this is the optional for the back ends and other other things yeah um so there are a few options um generally all the back ends are their own option so. DV there um examples allows you to use the built-in example data um so that installs a
36:58couple libraries that you use for that um and there's also like an option for Delta like tables um probably a few other things I forgetting as well um generally each back end is its own option optional install yeah cool um so
37:13depending on the back end that you want here we'll use uh the DB because it's easier to uh to set up and um and the
37:22point is like this is what we we saw earlier here we create a table um we pick uh an example from it's a
37:32data set penguin data set yeah yeah we like showing that one a lot and uh what is the fetch I think it's because it's it's a lazy yeah um so fetch would actually get you the Ibis table um from that penguin's data set um and then two Pi arrow is just um I'm not sure if that's always needed um but I tend to add onto
37:57my create table statements and that just ensures that it's uh it works um yeah and that will create the Penguins table in the Penguins do. dvb uh file Theon
38:09dis there so now if I look uh my file here I have a penguin. ddb and so this is a file a database file for ddb uh which is
38:22persisted uh because dctb run in memory so if you don't persist uh they that to any file CLE your data is L after the process um so here is basically just a persisting this uh I think I really like that in one uh line you get uh you know sample data set so great job on that I
38:44think it's uh um to overlook sometimes when you have a getting started they show some data I think it's nice to have proper data set to to just start it so congrats to the team on that um then uh
38:58here we are listing uh the tables which are in um in the right yeah so you get the Penguin's table which is a the table on disk and then the Pyro M table is I believe a temp table um or maybe a view um so yeah if you actually uh um if you were to reconnect I don't think you would see
39:20the uh M table um persistent yeah and so
39:26here I'm just calling the the the table
39:30um and putting this and then here we are displaying it uh andc to bandas yep so all I all all Ibis tables have this two pandas method uh two pyo and also two polers uh which was a recent Edition so you can convert to any three of those and uh do whatever you want with them um yep so to PO like this yep um assuming
40:00you're on the latest version I think uh are you on 9.0 yeah
40:09let's try it I mean let's try that's that's a why not ah there there is something missing on the back end on the option um it should be there um although you would need to install poers as well but the method should be there regardless of whether polers is
40:34installed okay so basically here uh well
40:39we look at that if we to install but so
40:43pandas is definitely uh build in pandas
40:47should definitely be there yeah and uh
40:52there can you talk to us about that what what is it all about yep so IIs is Lazy by default um if you run expressions like if you were to run head up there um without the two pandas uh nothing would happen um and you can change your expressions and do all that stuff um interactive basically uh kind
41:14of eagerly gets you the output of the Expressions that you're running um so typically you'll you'll turn on interactive and yeah you'll see a stuff like this which is representing what you're going to do with the data um typically you'll turn on interactive mode to do Eda or any kind of notebook local development and then uh turn it
41:33off if you're putting stuff into production um and then yeah I think I want to insist on that because I've seen couple of you know no matter the framework so lazy you know a data processing framework only compute when you request it and especially in distributed compute if you uh you know fetch the data basically you kind of
41:56like go against guess the framework of Distributing the data and Patch everything into a single node just to display it so you need to be uh careful with that and exactly as you said if you're like in a debug mode or exploratory that's totally fine um uh in a notebook notes that's what we do here basically with this option and then if I
42:20run this option and I run give me frame right so basically
42:28this is noted anymore and it's fetching every time be careful a big you
42:38know with power comes great responsibility yeah and then here there is um just simple can you write can can you tell us a bit like how did you decide on and the different you know typical transformation function uh for the I IB
43:05API yeah um yeah I did not make these decisions but it's generally inspired by um pandas of course um like I mentioned West mckin created pandas and then created Ibis um so some of it definitely comes from that and learnings from uh what didn't work well in pandas um then over the years a lot of inspiration is actually taken
43:26from R and the D plier um uh Library um
43:31we find that you know the r people actually really have very good ideas and we like to steal them and bring them into python um so that more people can enjoy them um so like there's a there's a re relocate method which will uh shift which columns are shown at the beginning of your table um that is taken directly
43:51from Deer um and then SQL as well um so kind of pandas and SQL and r the inspiration this we have
44:01here that's really impressive that you give back critics to our people I don't hear that often and I think because it's it's true that history has shown that I mean the data frame concept were coming from and I think it's a really good summary saying that we still their ID to put it in Bon so that more
44:24people yes that's an interesting message
44:28um I want so this is basically uh just a a simple result uh transformation you get uh the result here after now what I want to do is go to uh maybe um you're getting started
44:44guy and go for you have we we've done
44:48like quickly this one right um and you have three flavors why is that so um we these are intended if you're coming from one of those other Frameworks um so deeper um that was written by somebody who is um I'm not going to be able to recall their name but they know R in deer very well um and
45:12I will get made fun of for not remembering this but um that's a very good guide if you're coming from T um the the pandas guide was actually written by um I'm not going to remember his name either but he uh contributed the single store back end um I don't know if who wrote the SQL one um but
45:29yeah these are intended for if you're coming from those and you're trying to learn Ibis for the first time this should kind of um be a good reference for you of how do I think about um yeah like filtering sorting um limit and offset joins how do I map all this um and I'm actually in the process of
45:45working on a pie spark and a polar guide here as well um so look out for those soon yeah and the is it possible to some
45:56extent to but some SQL query or do you like is there actually a meod to uh send directly a SQL query yep um if you go to uh I think our
46:10homepage um you'll see it on there uh yeah if you click over there and scroll down a bit um and I think it's toward the bottom um yeah right there um so you can actually get the SQL out of VI Expressions um and yeah there's a table DOL method um The Dol method is also on the backend connection um yeah you can
46:35just put in SQL strings um either read them in from files or write them directly or use Python like formatting or even Ginga of course um get your SQL string and just put it right into your table um and then it's cool that you can chain that directly into a data frame code as well um here you're Bic down of
46:54mixing the things y select account uh and then the order by which is uh calling the on the API um how do
47:07you see people uh usually working with this with that with they have SQL code base um and go into Data frame do they is there like do they do this and then refactor by step by step or what do you see exactly in the usage person because there is kind of a refactoring step right to be there is um there is a highly
47:35experimental way of actually parsing seel directly into IIs Expressions so you can take this SE one get back code um that's not super working and I don't think we have it documented anywhere which is probably good um but in the future in the future yeah hopefully it's something that we can kind of automatically do for people who want to
47:54refactor um but yeah this is always is an option if you have some gnarly SQL string or um the other way that's gets used is basically as a escape hatch if uh if there's really something IIs doesn't support or that you just don't know how to express um you can always kind of run this SQL if you want and uh
48:12get your IIs table out and then switch back to IIs code um yeah in general um you're gonna have to refactor your code at some point and the the promise of Ibis is it should be the last refactor you have to do because it'll just work on every back and um that you need we have in the chat uh DP man I hope I'm
48:36pronouncing that's right um so from World trone which mentioned that we do not have the latest version I need to restart their session so if uh if I do um an upgrade here uh we should be able to do uh to polars which was part of the
48:55latest release zero do uh n 0 yep .0
49:01release uh so thank you for that uh I just uh I just put it there and uh we'll try it later but uh to come back to yeah I think what's uh so so this is great
49:15and like I'm curious like please ping me when you have something like that to put your SQL code into uh uh data frame code
49:24would be neat I think also this is fairly easy to start refactoring step by step uh from coming coming back from SQL
49:35we have a couple of minutes left and there is a topic um a bonus topic I want to talk about which is
49:46um was and your uh let me search that for a moment
49:55[Music] so your Ripple in Jupiter light in the browser yep so I'm running on Braves so hopefully there won't be any bug because those kind of on Brave doesn't doesn't look good oh oh boy can can you so this is really heavily experimental but I just want uh to kind of like uh you know
50:24hand up with a with a a bit of a dream can you give some uh context around was
50:32and you know um B project and how those
50:37came to life basically having python in my browser running so this is not running on the server yep uh this is running in your browser um I am far from a wasm expert but I will do my best um so yeah web assembly is a way of basically running code in your browser um the pide project is a project to run python in your
51:00browser um and then part of the pyodide project is getting various packages working um so like pandas is in here Matt clet is in here um what else the the biggest blocker for us was actually py Aro in the browser um so we um uh
51:16were really following along with a Community member who is working to get that in pyod um it got added pretty recently and that was the last thing keeping us from putting itis in the browser um so we did that um and yeah highly experimental this will like crash on iOS um I'm very happy that it worked on your browser here um but uh yeah it's
51:38just a it's a really cool way to um try out Ibis or other things for the first time uh you don't need to have a python environment you don't need to pip install anything um you're just kind of in here and you can play around with it um and ideally we want to like put this side by side next to the tutorial so
51:55that you can just like copy and paste and follow along and bunch of cool stuff we can do with this um but yeah so we've been uh also exploring exactly the same thing for D and I think also having uh you know an
52:13easy way for people to getting started I mean notebook are nice but you're still you know relying on the server and we are not leveraging or expensive MacBook uh at least me enough up right so uh so this is really a nice thing that we start to see more thing in um and this is python so we could you know with the
52:34things go growing could even train model locally with just an URL and have your full environment ready and this is just leveraging your your local computer so this uh and uh if you want to see other
52:50uh while case we have on our channel uh some on YouTube channel we have some type
53:03PL that train too so see that
53:12more the project that you uh end to
53:16project regarding analytics uh of uh Ibis so this is
53:22the the app and I'll share the the the
53:28repo uh on the on the chat so that you
53:32you have it if you want to have a look um so let's look first at the app can you tell us a bit uh what is it exactly yep so this is a streamlet
53:47dashboard um all of the ETL and uh code
53:52that's running here on the dashboard is basically Ibis code um and I as the product manager for the IIs team kind of use this to keep an eye on a lot of our um vanity metrics um how many stars do we have how many downloads do we have um who's visit how many people are visiting the docs how many backends do we have um
54:10yeah all this is running um there's basically a Cron job that runs duck DB on a server um all the data gets transformed uploaded to mother duck and this dashboard is running in real time off of mother duck um so like if you change the number of days there it's like active um so you can you can select
54:28some things and the code will rerun and uh query stuff on mother dub through dub DB I and update the dashboard so yeah this was a cool like end to end project um for me to learn a bunch of different tools um and uh yeah yeah runs pretty
54:46well that's pretty nice um I need to build the same but uh so just to say so if I when I'm running something here what it's happening is actually Bruning a query against mother deck and get it back it's uh it's actually pretty fast yeah um we are looking uh to do a
55:07bunch of things some of it's going to be a little boring um so basically working to stabilize the backend apis For Ibis um really all the apis um so that we can have a better foundation for um more external contributors um so like I mentioned Ibis project is a uh it's it's not only owned by Voltron data um it's a
55:28self-governed open- source project um and we would love to be in a place where uh database vendors and whoever um kind of choose Ibis as their data frame interface um so working to make the the apis stable um and where a company can just come in and maintain their own back and their own repo um it's going to be a
55:47great Next Step um but otherwise we're we're quite community-driven so when people open issues or feature requests or um anything else uh we tend to address them pretty quickly um we're also working on a ibml framework which is for data pre-processing uh mL pre-processing of data um so you know pandas is kind of what people use but
56:10what if you could just use duck DB or poers or data Fusion or Snowflake and um get your data ready to be fed into machine learning models um yeah lots of A lot's coming down the pipe um we uh we're pretty active on our blog which is a good place to follow along with new things um yeah uh exciting future add
56:30for the Ibis
56:35[Music] project this pragmatic and Technical so please keep it up and yeah I'm curious for the or anything people TS to be really
56:52depend for for LL pipeline so that that would be great uh thank you again godi and I'll see you around online because you're traveling yes I think we're a little far away from each other but uh yeah thanks for having me on and uh great to do this with you [Music]
Related Videos
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem

2025-12-10
Watch Me Deploy a DuckLake to Production with MotherDuck!
In this video, Hoyt Emerson will show you the fastest way to get DuckLake into production using MotherDuck's beta implementation. If you've been following his DuckLake series, this is the next step you've been waiting for!
YouTube
Data Pipelines
Tutorial
MotherDuck Features
SQL
Ecosystem

2025-11-05
The Unbearable Bigness of Small Data
MotherDuck CEO Jordan Tigani shares why we built our data warehouse for small data first, not big data. Learn about designing for the bottom left quadrant, hypertenancy, and why scale doesn't define importance.
Talk
MotherDuck Features
Ecosystem
SQL
BI & Visualization
AI, ML and LLMs

