DuckDB & Mosaic : Interactive Insights on Large datasets
2024/03/15Featuring: ,Quack and Code with Dominik Moritz (Carnegie Mellon University / Apple) & Jeffrey Heer (University of Washington) - academics researching and developing data visualization tools used by thousands of people around the world!
Transcript
0:29for
0:58for
1:28is
2:28for
2:58that
3:28for
4:29okay
4:59hello every everybody how is it going uh I'm super happy to have you here we already have uh 120 people uh on
5:09LinkedIn also so say hi in the comments and tell me where are you coming from so I'm maybe a data engineer developer Advocate at mod duck and I'm based in Berlin so Europe uh but I have actually uh I believe the two people uh at least one confirm already that is their base currently in the US uh so yeah we going to talk about uh
5:35data ration and uh data rization is a
5:40topic in data which is a bit tricky I feel that there is two word there is basically the bi word where you have those dashboarding tool um Tableau Barbi whatsoever and then you have those this Library more you know coding databas utilization tool you do have a bridge that is appearing now uh with like bi as a quote but it's still like two still
6:04two different words and today we're GNA more talk about the later one um so datation libraries uh that help you build uh basically data apps with dashboard in visualization and I'm super happy to have uh the two author of an amazing project that we're going to discover today which is called Mosaic and we have uh Dominic and jeffre which is right
6:31there let me come back to them on the stage Dominique J how are you doing good thanks for having us yeah doing great so Dominic you're also in the US right yes I'm in Pittsburgh in Pittsburgh but I'm from Germany so I yeah uh grew up um in Hamburg and just
6:54west of Brin for most of my life wow oh okay so not uh not so quite close from uh from where I am uh how is your German
7:07now that's good um yeah I think I think it's it's like with the time being like for me it's French because I I come actually from Belgium um it's not that you get better at English you get worse at the two language I feel so yeah it's a it's a bit like you just lose what you acquire when when you
7:31were a kid we have a couple of people in the chat as you can see uh we have uh people from Boston uh also
7:40California so yeah a few of
7:44them yeah um so yeah let's let's start
7:48with a simple introduction um Jeffrey do
7:53you want to start to talk a bit about your your background and what you're doing sure so my my day job is as a professor of computer science um I currently work at the University of Washington in Seattle um prior to that I was a computer science faculty member at Stanford University for five years and throughout that time I've been deeply
8:13involved in a variety of data science project but especially um systems and tools for creating data visualization and throughout the years I got the opportunity to work with a number of amazing students and collaborators um so that includes folks like Mike bosck the creator of D3 and then later on um really amazing students uh now professors like Dominic uh we did lots
8:34of work together and we continue to collaborate um which will be the the topic of today's discussion on Mosaic oh that's amazing like I I was referring in the intro the you know the more uh database Library word and the first thing I'm thinking is D tregs I think it's like initially the you know the first first standard that came true
8:55I I if I'm correct right you're free to interrupt me that's you expert but in term of timeline I think there it was right one of the first library that pop up no there's a history of data visualization tools that go back decades but I think in terms of popular open- Source stuff there's a number of things
9:13like in Java and Flash and then I think first protovis and then D3 I think we're really well timed with the rise of JavaScript and the graphics capabilities of the browser um so but it's it's you know alongside a number of other libraries that also came out around the same time before diving already in the div I I
9:32have so many question already in my mind for you Dominque would you like to introduce yourself and uh yeah how how did you come into the data word yeah uh so I did my PhD at the University of Washington uh with with Jeff um and learned from him about and together about data visualization tools um developed um VGA light during my PhD
9:55together also with my collaborator uh hem and uh as of 2020 I'm faculty at K
10:02Mel University where I continue doing research on data visualization tools um
10:09and also uh much more accessibility so figuring out ways to make visualizations which are traditionally um impossible for people to um to to understand when you're not when you're blind for instance um but still possible and on my other job I'm also uh managing a visual Resolution Group at Apple um but today we're talking about the the work that we're doing at
10:34cimu yeah you're you're you're you're a busy uh person I get that again thank you for for your time here it's it's amazing journey um I had a question um
10:47uh popping up uh you were mentioning uh veal light can you can you introduce a bit the the Vega project for people that that are not familiar with um yeah maybe Jeff from the there's I guess a history yeah Jeff you can go you can go ahead yeah for sure yeah sure um so Vega actually started I think back in 2012
11:14and initially it was just to simplify prototyping for D3 visualizations uh was the original idea like having a a declarative language in this case using JavaScript object notation or Json as a file format for describing visualiz ations um and then as it went deeper it sort of took on its own life actually it uses a number of libraries but
11:34particularly D3 underneath the Hood um but then we we built it out um as an open source project and then it took on a new life a couple years later as um PhD students including our collaborator Arvin who's also a co-creator of vegalite used it in his PhD thesis to explore how to add interaction so instead of just writing event handlers
11:53like event callbacks how do you actually have a higher level of abstraction for talking about interaction techniques um and so we're able to create a pretty wide range of visualizations within this declarative format in Vega but it also can get very unwieldy in verbose and so we also wanted to have maybe more targeted languages that have a much
12:10smaller surface area that they're easier to write for the real common types of charts that you want to make and that was the impetus for Vega light which is you know sort of a smaller language that compiles to Vega and I'll hand it to Dominic to to say more about uh some what vegalite brings to the table Yeah
12:27VGA light actually originally started at has a project for another project called Voyager where um together with with with him uh we were looking at visualization recommendation and the idea was that rather than when you're doing data exploration rather than having to make every single chart um and defining all the details of it what if you could just
12:48browse a gallery of visualizations um that are that you can maybe steer a little bit into the direction that you want to explore but all the tedious work of specifying the visualization is Tak Care by a recommendation engine and um
13:03to build this we needed a library to create visualizations something that we could pratically generate uh especially in a browser and um there wasn't anything that fit the bill so uh together with with Jeff and Arvin and H U we created VGA light as a as a high level language um that then compiles to Vega and can can render charts um and
13:28then Vega light we noticed that it is useful by itself not just as a piece of Voyager and then continued to work on it added also with Arvin to uh the interactions to it so the idea was that we could again through high level uh highle Concepts uh support making interactive charts um and yeah that was that was
13:52almost almost 10 years ago uh that we that we started veal light and um it's
14:00it sure does and so um so just to be clear it's like so you mentioned uh how your um basically um
14:11abstraction for um declaring uh into
14:15specification file it's Jon or yaml I believe correct me if I'm wrong it's either work you it's Json but you can always parse yl to Json yeah yeah I'm just H I'm just because like that's like just the two common standards I was just figing out which one is a but I know Jon because I I've used long time ago so
14:37viite and so what what did you see regarding like the ter of usage because it's still a library to is it mostly a library destinated to um JavaScript user or because I people
14:53can still use it in in their notebook right in Python I believe in in widget is that correct yeah so veal light is um a compiler that takes a vegalite specification and compiles it to vegga specification and then the Vega Library can purse it do some transformation on it and render it into a web Canvas OR SVG or we also have
15:16some experimental web GPU rendering and there could be other ones in the future the um one of the cool things in all of
15:25the stack is that because it's declarative you can and and it's designed for programmatic Generation you can actually generate the specifications from other programming languages uh so for instance in Python the there's a um an API called alter that has become quite popular uh that um where you can basic programmatically or where you can build up a specification um that then is
15:53is a dictionary um that can be represented as a Json object and that then can be rendered by Vega light Vega and and so on um so because of this declarative design you don't have to call specific functions from a programming language or from a particular um standard library but you can generate these specifications from basically any any programming language
16:17and so that design made it very easy to build different rappers like alter and python um Jeff actually made one veal light API in JavaScript uh there's other ones in Julia in R and and other ones but I would say yeah a variety of language I think Scala somewhere oh okay and and which like which one uh are is predominant do you
16:41know a bit like roughly was the distribution over there I think there's two main ones so one is just using it directly in JavaScript um either either in a JavaScript environment like an observable notebook or elsewhere for exploratory analysis but also in a lot of web developments if you want to add Charts to a website if you want to do
17:01visualization or analytics apps reporting um lots of software Engineers are using directly in Java Script um but beyond that I think by far the biggest user base is the the alter user base so folks using the AL API and python to generate veal light visualizations um for a variety of reasons you know typically exploratory analysis but also
17:20to generate plots to to share results and so on yeah and at the same time the original goal of vial light as as a target for programmatic generation I think is still true in that um these declarative specifications can also be an exchange format uh so if some application wants to have wants to support visualizations but also wants
17:41maybe users to provide their own visualizations then uh Vega and Vega light are are popular Target because somebody could just put the Json specification or yaml or something that is generated maybe through a UI Builder um and you can store the visualization itself as in in your data datase because it's declarative which would be much harder with code because running I guess
18:04eval on code is usually something you want to avoid but you can that with the clar specifications that's that's the that's differently U interesting and I've been uh you know losing some hair around just man trying to make graph work with pure code rather than declarative I think it's we should have probably a bit more opiniated um you know type of graphs do
18:30you see I I I have a question regarding this and just like a small parenthesis because you mention jeffre observable for people that doesn't know um so I'll put in the in the comment uh the link is just basically um a notebook servers to run uh JavaScript so it's usually the place where uh database nerds uh
18:54because you you can use basically any JavaScript um database Library and kind of tell a story um around it um so yeah
19:03my question was regarding the trends regarding those um uh opened um charts
19:10do you see something going uh Beyond just a simple uh specification happening in the database words so just to give you to give you an example if you're not really connecting the dot I see some bi tools basically um
19:28Direct ly saying this is my data set and this is the graph that you should look at to get inside from it right um so I'm not even picking um by default there is already um a couple of graph that's being displayed based on just the data set what do you think about those Trends um I'd say we might even be
19:52partially responsible for those TRS we we've done the Voyager project that that Dominic was describ earlier was a start of another related uh research work around how do we not just make tools for expressing visualizations but also systems for reasoning about visualization design and providing guidance and so that can include you know giving a data set and some other
20:14metadata you know generate uh one or more charts that might be reasonable for it or even giving a partial specification complete it and so these are some things that people are of course doing with llms now um with varying degrees of success we've done kind of more of classic AI kind of structured knowledge representation systems where we actually take um as
20:35input in you know experimental data from like people doing tasks with charts and actually use it to try and learn weights within a system to actually make design trade-offs and this was actually a big piece of Dominic's thesis so maybe I should let him describe it um but the idea is that you know there there's various ways to either provide critique
20:51or Aid the assistance of creating visualizations um and it's actually still um lots of interesting tools coming out and still a very active um research area within a visualization research yeah domic do you want to to comment on that I think Jeff described it very well one one thing to add also with llms now uh I think declarative
21:13languages like regalite are also a very interesting Target for for Generation again because you can uh because these declarative visualizations described what describe what you want not exactly how it's being evaluated or executed um and so that makes it more conducive to um automatic reasoning in and generation also from llms and also you can maybe trust the code that gets generated a
21:39little bit more um because you don't again you don't have to evalate um one of the that's also one of the reasons why Vega was for instance adopted in Wikipedia as a I think the only visualization language for interactive visualizations so you can actually add bigger visualizations to Wikipedia um oh that that you didn't know that's pretty cool since since when
22:04it's there so it's it's there for a while I guess been there for a long time yeah yeah they may still be on I think they're still on an older version of Vega too just as proces but yeah it's it's been there for for years now can you can you give a bit the timeline and we'll switch now uh uh close to to
22:21Mosaic but like Vega when when was their first relase you have like rough timelines was like sure the first version of of Vega I think the 10 I don't remember the exact date but it's either 2012 or 2013 and then we started to work on um interactive Vega and then also Vega light um along the way we did
22:41a complete rewrite of Vega starting with version three and that must have been around like the 2015 time period um and then and then the the real exciting versions of of Vega light where then came out with support for multi view visualizations where you can build you know full dashboards um as well as um a wide variety of interaction techniques
23:02so for example panning zooming uh brushing and linking so selecting elements in one chart having them highlight or filter in another chart um and that all released around the same time so it's I think really in the 2016 2017 uh time period yeah and so now so now we have those periods in this like Heritage let's say so what is what is Mosaic do
23:26you want to do a quick a quick intro and I suggest we we go t over the documentation website and see what are the you know the core component uh but yeah right so so the the true story uh behind Mosaic is that in addition to all the things we've already been talking about today uh Dominic and I have also
23:45been really interested in how do we make visualizations more scalable um and we've been working on techniques and thinking through possible architectures for a number of years um and then really duct B came on the scene and and was popular uh you know was getting interested in it and so then I started kicking the tires and playing with it um
24:03and then sort of lost track of time and then as as I was able to do you know seeing seeing the kind of query performance I was able to get I mean really quite impressive you know a single node uh server configuration but also like you know surprisingly good like just in the browser alone using Technologies like web assembly um kept
24:21playing with that and then really kind of came up with an idea where you know visualizations have different needs and you know they they they take in data they transform it to some degree they render it they provide interactions but you also want to if you can like offload as much of that computation as you can to a scalable back end and so that's
24:40something we thought about you know um for many years and something that we talked about in the context of Vega but never really you know implemented in in a way that that that worked in a I'd say truly scalable fashion and so really just thinking about how instead you take a visualization component and basically have it um publish or broadcast what
24:59data it needs but do that in the form of a query um and so that way you can leverage a database to do a lot of that data processing at the same time you can have this Central coordinator that can see all of the queries it can actually also just wait a couple milliseconds and gather a set of queries so that they if
25:16they're all very related you can actually then condense it into a single query and get more efficient or you can recognize some some patterns in those queries and automatically build indexes to make um interactions over massive data data sets really fast um so those are the kinds of things that we wanted to do that we weren't yet able to do in
25:33our tools um that then led to us um building Mosaic which I'll say you know just to wrap that up is really just two things so this Mosaic its core is really an architecture and the idea is that any visualization Library you know should be able to play in this architecture and the goal of Mosaic is to provide access
25:51to that database and to provide a way to share selections so basically what are my filtering queries or or My selected ranges in share that across views to get that kind of linking um and then on top of that you can then build components in libraries that then use that and so a lot of the examples we'll see today are
26:07use uh use a a library we built called VG plot and that builds on top of observable plot for for rendering um but you could add your own custom components written in D3 or something else and as long as you wrote them to um you know work with the Mosaic infrastructure you can then for get linking between
26:24multiple different visualization tools even for free yeah so yeah that's that's uh that's a lot of information to wrap up but so to my if I understand like correctly the like what was really uh I would say um
26:41Innovative here is that you basically use kind of a database in the layer to link different graphs is that correct um not exactly we use the database to do the data processing and also to BU build build indexes that can index types of linking yeah but it's the the Mosaic architecture is actually what's doing the linking it has the
27:05information of how the views relate and then it coordinates um as intelligently as we can you know the access to the database to then support um those visualizations and those interactive updates yeah so um just to wrap it up on
27:20the on the database itself uh so you mentioned um so Doug DB and also what was really interesting is that you mentioned Mill second right because I feel like in the database realization word there is this at least in the bi word like that was the the other word we are not we are not used to actually milliseconds of of
27:41of reaction and so how um and so basically we are uh Mosaic is running um Doug DB in the browser right is that's correct and so that's using uh wasm could you maybe um damin can you uh explain to people we are not familiar with the was technology uh what enables you there to have it run in the
28:06browser um yeah so was more web assembly
28:10is a um a kind of binary format for um
28:16for the browser where you can take programs written in uh in different programming languages and compile them to a binary format and that's uh as opposed to how you usually run code which is with with JavaScript and in JavaScript you have your programs as text and then they get interpreted uh and executed by and optimized by uh by a
28:39JavaScript engine that comes with your browser in web assembly you get a binary um that has instructions which are somewhat similar to machine code um that you would have if you compile like a binary for your uh x86 architecture or your arm architecture yeah uh but the there's a special instruction set that was designed um for um that's a little
29:02bit more high level um but not not very much um and so what that means is you can take programs that are written in rust or in C++ and compile them for web assembly but the way they run is that they run in a kind of somewhat isolated context um so in order to uh communicate with the website for instance the the
29:23thing that you see in in in the browser or with JavaScript you need to have you need to call specific apis um for for web assembly uh the reason why web assembly is awesome is because well now you can take programs that are written not for the browser and run them in the browser and also the performance can be
29:41quite good because they you don't have this parsing and interpretation overhead uh that you typically have in JavaScript yeah you want to add DBS go ahead with DB then yeah so DB is probably people
29:57people on called No duct but the um right you can think of it as a as a SQL light but for analytics and one of the
30:08main selling points of of dctb and and sqlite is that it's an embedded database an inprocess database meaning that you don't have to have a server that you call but it the the database runs in the same process as your program um and also
30:23the um the program or the the the binary
30:27is quite Compact and so that made it a good Target to compile it to web assembly and so that you able to run it entirely in the browser uh and yeah originally we started this um uh with a project uh from Andre who was a PhD student at the University of or at H Munich and uh he started this experiment
30:52to take DB and compila to uh to web assembly run it in a separate worker and communicate between the uh the web assembly context and the JavaScript context through um an API that we designed and uh yeah and that's that then evolved into Dr bwm which is the the library that you can use in Love Today cool that's uh pretty
31:15well I I took you on the spot and that's pretty well summarized because it's not easy and and just to understand because uh you mentioned we have a few people that uh understand uh what is ddb and act you can go to uh uh shell. db. org
31:32if you want to experiment um let's say pure DB wasm um and and so what what I
31:41mean is that uh because sometimes it's not so clear for the that people let software engineer uh that because they have less background in software engineer is that this is DB is running in your browser so the compute is happening in your browser on the client side there is nothing running in the server if I'm shutting down my internet
32:00connection here I still can run uh the ddb and so if you for example
32:08generate uh let's say simple data sets I'm just call um DB gen which is generated some some tables and so now I have some table here uh to uh to to query over there so this is everything is happening uh in the browser uh thanks
32:27to to to DB and as you could see it's
32:31like it's already quite some data set and everything was uh was super fast because there is no uh inbound traffic to any uh related server um so now going back to
32:46um uh to Mosaic uh maybe what you could go is uh go to the main components so here I see core SQL I would say maybe
32:58it's back um could you could you you walk me through to the main parts sure I think since you're on this page maybe a good thing to do is just uh first just bring up that chart that you can see and start mousing around and so you see this is showing stock data that data is being loaded into duck DV and then we running
33:16it through a library that's creating the visualization but then you can see as you're moving your mouse around right is selecting a particular date and then the data is being renormalized so basically what you're seeing is had I invested on that day what would my returns be and what's neat about this is not only are we pulling the data from the database
33:35that as you select that date it is going to then actually it has SQL Expressions that are being re-evaluated so on every little movement every little frame you're actually kicking off additional queries to the database that are recomputing um you know what you see and then bringing that back into the browser um and so if you you or or I say it's
33:56still in this case duct DB is in the browser but it's going between the worker thread and what you're seeing on on the screen here and similarly here go ahead and start um selecting in that scatter plot right this is a Seattle weather dashboard that we uh first wrote for Vega light um and here you can see
34:10you know different types of weather um and you know the color shows the weather type you see the temperature in the chart above and everything is richly linked so here you've selected on a date range but now you can go down and say click any bars in the chart below and that will filter to a specific type you
34:26notice it's linked to the legend if you go up and toggle things in the legend you'll on the right you'll get that same behavior and so each one of these is is similarly you know it's a it's a visualization that's pulling its data from from duct DB um but then every interaction here is generating new queries that are determining the either
34:44the filtering or the highlighting state of that visualization um and because the database is so fast um you know we can do this very quickly and we can also then as we'll see later actually run this into much much larger data sets yeah um and so this is a great page for this actually so if you go down we can
35:01just do two more quick examples this is what's called an overview plus detail where you have the overview of some time series and then you can zoom in based on filtering and so as before we select and we filter um but what's neat is that we can also apply other types of optimization techniques and that this time series actually has 50,000 points
35:21but we have less than a thousand pixels in which to draw it and so there are techniques um for example that are smart about samp the values kind of within a pixel column so that you're only sending you know a certain number of values per pixel much much less than the actual data set but in a way where perceptually
35:37you shouldn't be able to tell the difference in how the visualization looks and so uh the technique is known as M4 and there's a nice variant that we use that dominant help create um that we automatically apply these um query optimizations as part of the process of the what we send to duct TB yeah and then the last example if we go down is
35:57um cross filtering so in this case it's um flight delay data so this first chart is a histogram showing how early or late a flight was and now if you select those late flights you'll notice in the chart below which is showing what time of day did the plane leave that you know maybe unsurprisingly late planes are more
36:15likely to leave later in the day because if a plane gets delayed it might have multiple routes that day and so those delays are going to Cascade throughout yeah right and so what's neat here is that with every Movement we are both ref filtering and reaggregating the data um but we don't even have to do that directly we don't have to do that over
36:32the source data by analyzing this this data and the setup here by looking at the queries that are generated we automatically create pre-aggregated data not aggregated all the way to what you see on the screen but aggregated just enough to support the interactions so we actually get queries over smaller data summaries that are much faster which means that well in this case it's only
36:53200,000 data points that's still enough to break a lot of webbased visualization to we can 10 million 100 million even a billion data points and those the size of those summaries don't really change in comparison and so we're still able to support realtime querying um over these charts um you know using duct DB as the engine alongside these optimized indexes
37:14to kind of reduce the query load of of what we're doing on any particular interactive update am Mar c now um we we'll go a bit
37:24we do a work through just afterwards but um as I hearing is that there is a lot of optimization that's being done on what you're sending to uh dgdb right yeah and from a user point of view is there is this abstracted or is this still um to the responsibility of
37:44the it is Al so it's it depends if you are developing a new visualization component there are and there's optimizations that are specific to what you're doing you can imple ment those but a lot of this is being done transparently and if you're an end user you're just using the visualization libraries that already exist it's all transparent to you like you don't have
38:07to worry about it the only time you'll really notice it is like there are some things that we can't optimize and your performance will get slow when when that happens um and then you'll notice um but otherwise you're you know as a user you're not doing the fine-tuning or making decisions about what to optimize that's being done by the system for you
38:26okay so yeah one thing though to maybe think about is like who when you say user is it the person actually looking at the chart and interacting with it through people that's a developer that's a really good question yeah that's the developer I mean building the chart yeah uh because if you if you go maybe to the
38:46Mosaic well I guess if you think of Mosaic as really the architecture um then uh that that's this
38:54Mosaic core in this in this guide here um you can actually build custom clients because Mosaic is this architecture that sits between the database and interactive clients which could be a widget like a dropdown or it could be a table or it could be one of these charts but you can actually also add your own um and then you have to implement a an
39:16interface this client interface that uh that Mosaic provides um and then as a developer of one new client you have to think about well what are the queries that I that I need need uh what is the data I need what is the query queries for that Mosaic will optimize some of them transparently for you but not all
39:34of them so for instance the optimizations that Jeff showed um with in the in the focus and detail chart with the two lines um that is something that the client itself implements so so in this case we have the viz library that has line and area marks and they know when to apply this optimization so similarly if you were implementing your
39:56own client there are optimizations that you could choose to apply yourself and basically it's by what queries do you generate you might do that but a cross views for some of the linking including caching consolidation and some other types of indexing that's done in the core infrastructure and and can be done across um you know a variety of
40:16different types of clients so long as they generate queries that that coordinator at the center of it all um knows how to optimize that's uh that's great to you I think that's that's that's a pretty common problem actually when you start to give uh tools to developer to build chart they're not often like SQL expert or they're not familiar with the
40:36optimization that's possible with the target database right they're they're quering it like that's it's two different word and uh and at the end if you if it's if the query is not optimized then your your graph is slow for for multiple different reason that's sometime hard to debbag um but it's it's really it's really great that there is
40:56as I understand like a canvas that people can use and then they can always uh create their their custom client if they need to is that correct cool um regarding uh so we we talked about the core a lot um and time is flying um so maybe you can give like uh a quick words on maybe I can maybe
41:20can do like a a one sentence version for each of these exactly Mosaic SQL is just helper libraries for writing SQL query and the big thing that they provide is that then the um the the core can analyze those queries without having to parse them because they're already in a structured representation yeah this is this is nice
41:40for people doesn't like uh SQL in text string in their code based I really like that's do remind me the the Scala uh Scala Vibe but it's more than that because that actually constructs an object that has all these different properties in them that's where then M can reason about it if you had a string we would have to parse it and construct
42:02an object to then reason about it so actually in MOS it's better if you use this structure generation rather than actually giving a string if you just give a string then uh the mosic coordinator will will not know to optim we'll send it to duct TB but we won't optimize it in that Cas yeah at least not yet maybe maybe we'll be able to do
42:21parsing in the future but currently we rely on having this this structured representation for aquaries all right so that that's that's interesting um what else Mosaic input I guess that's for yeah whatever input you can use for these are yeah yeah sliders drop downs search boxes but ones that can be um powered by the database so it
42:44could be just a standal Lo dropdown with where you chose the options or it can query a column and then give the the unique values in that column as the options so so all of these can be populated by a database and obviously participate in these link selections between the views all right uh PG PL is basically
43:04the um the graph uh render engine it's this
43:11one is based on observable and actually we had a question it might be a diff it might be an opinionated question uh but wait I'm going to grab it I can see build Mosaic plot on top of observable pla of the Vega yes exactly gcha this one um I think either could work I mean the idea of Mosaic again is
43:36primarily as an architecture um and if you go to the um Mosaic GitHub repo you'll actually find a proof of concept Vega example as well um that that Dominic wrote um but why I Ed plot in the beginning was mostly because as an author of Vega light I wanted a chance to a learn observable plot because it
43:54was something different so um I thought I would help stretch um you know my thinking and and make sure we could do things right the other thing that was just really really convenient about plot is that unlike Vega which is designed to be interactive and has a lot of internals uh to support interactivity plot is really designed to just take a
44:13specific specification in and produce SVG out and it's designed to do that really efficiently with a lot without a lot of middle wear and since the way we were initially you know just testing and prototyping Mosaic um it was really helpful to just basic Bally get something that just goes from data in graphics out um in a really tight Loop
44:32and so it worked really well as sort of a plugin uh for prototyping a a visualization language um was never really a strong commitment at the beginning but it's been working out quite well and so we've been able to do a lot with what we initially was just supposed to be a proof of concept prototype language um that again is
44:50actually a proven rather robust so we've been been quite happy with that cool um and so Mosaic then we have
45:00Mosaic spec that's basically using
45:04descriptive yam Jon to to define a graph
45:08is that correct yeah and so VG plot is a JavaScript API and that's when you click the JavaScript tab you would see that um but you can also describe um Mosaic dashboards and visualizations similar that you might do with vegalite in a Json or yaml specification that then par into calls to that underlying API um and why this is really valuable is that a
45:31you might prefer the declarative one to basically store it like a file format you know be able to reason about it um but also similar again to vegalite you can use it across other languages and so when we start talking about how you use this in python or Jupiter um you actually generate a Json specification as the description and then pass that
45:50over um to a browser say running in a output cell of a notebook so one thing think about here is like if you think about Vega light and things we learned from it um or from other projects that a lot of that influence the design and features of M so having a declar specification we noticed is quite useful because other
46:12apis can can generate and so that that's why you see Yaman Json support here or uh somebody else also asked about the Falcon project yeah so you want to answer this one I can I can answer it yeah uh so the Falcon project it was um the was a project to make cross filtering really really fast by
46:36pre-computing um the an index that supports all the interactions with one chart at a time and uh the the way I implemented that was um that either I would Rec compute pre comput these indices kind of manually as an aggregation over an arrow uh data frame or actually using a using a SQ cry uh and that would basically
47:01then build an index which is a multi-dimensional tensor multidimensional array uh and then read out of that as as somebody's interacting with one chart um and that idea uh was what then Jeff took in uh in
47:17Mosaic but rather than having these indices as a dense matrix it's actually a temporary table in duct TV H um and so that provides a number of benefits uh namely it's sparse so you can look at much larger um scales better it scales better um especially if there's uh if there's sparsity uh and also uh it was he extended to support uh
47:45categorical data the original Falcon implementation only supported continuous data uh because that's what we wanted to demonstrate we want to demonstrate that idea and the Falcon application kind of demonstrates that um but it was never in the Falcon was never intended to be really used as a library um we actually subsequently built a library called Falcon Vis or a student built that
48:06um and that also supports some more of the um so besides like continuous data also supports categorial data um but then M also supports other Aggregates
48:19the original fact implementation only supports count Aggregates but then M it extended to support count and uh some and and averages and some others because you're relying on dgdb and is way easier to implement is that correct yeah we we made some changes um and how we represent the indices that make it easier to do a variety of queries and so
48:42and then having them represented as duck DB tables then made that much easier to then Implement those queries and so to me like the two biggest differences are you know other than the stuff Dominic mentioned you it's um you one it has it does more than the original Falcon um as Dominic described in terms of of of the
49:00different Aggregates and types of data it can handle um but the other is it gets applied automatically the previous ones was like a library where you basically set out set it up like you knew you were using this particular optimization and everything was organized around it now instead you author visualizations and when this technique can be applied it gets applied
49:19automatically and so it really I think uh changes the burden on developers to make it much easier and much nicer um to take advantage of some of these highly scalable interaction techniques for data vid and I remember Jeff actually at some point um you said something like oh I built this other visualization I think it was the the like athletes
49:40dashboard um and then all of a sudden it was like way faster than you expected it to be because MOSI created an index and you were like huh oh yeah there's yeah yeah yeah it wasn't that example but there were others where I didn't even even think about it as an optimization opportunity but because it was autom IC
49:57all of a sudden it recognized that the configuration fit and generated indexes and I was like oh oh that's nice I didn't even think um so I surprised myself uh which is always a good sign as a tool Builder when when the Tool does something that you haven't envisioned and and and does it well hopefully um that's always nice nice surprise to have
50:16no that's uh that's great to I think regarding speed and what we saw so far on the on the on the example is that this is if you use uh use a lot of like you you get basically uh addicted to that um I would say reactivity right that you don't have in in some and it's a bit like when you get a new phone or a
50:36new uh laptop it's always super fast and then two or three years later it's super slow well the problem is maybe sometimes often between the chair and the keyboard right but aside from that it's also just your brain started to be you know used to that speed uh so the bottom line I feel like probably if you start to use
50:55Mosaic it's difficult to go back to other kind of architecture that's relying differently because you don't have the same interactivity speed um we had also another side question um is Mosa currently a research project or do they uh plan to develop it further uh yes and yes so it's actively maintained and developed but it's also like just like Vega and Vega light have
51:20been out there and used by companies and others um you know while also being a platform for us to do visualization resch search um I think we we we've long kind of looked at our projects um you know as as both both both something and I I would say like you know there are some um rough edges still I would say
51:39this is you know a best still beta software but I think very promising beta software and so uh we certainly welcome um contributions um if folks whether it's just bug reports or if we've had others dive in and um submit pool requests Etc you know very welcoming love to have you join our develop Community cool um we we have uh we have
52:02a few a few minutes to go over um a working demo example uh just to show that you can run also locally if you want and I if we have time we'll we'll go to the Jupiter widget uh let me grab
52:16uh the demo for a minute uh just to hide
52:20uh sensitive data you have a deck P While I'm grabbing this um and I'm going to see so the repository is actually
52:33[Music] um up missing is your
52:40thing uh available on the get up so if
52:45you if the people uh live want to follow along I'm just grabbing the link here
52:52because I didn't prepare was my client
53:00um so I'll put it on the LinkedIn chat
53:04uh demo uh we're going to rent a mosaic
53:11example uh with uh mod duck and the question is that we we talked about Doug DB wasm um and basically has Doug DB wasma can communicate with mother duck we can pull the data directly uh from mother duck then basically all the execution and the rest of the magic that we talked for the first for almost an
53:32hour is happening on uh the client s um
53:37so let me show uh maybe I'll pink also
53:41the link here for people on YouTube for just for a while and uh we can uh go so this is um
53:54just a no GS onv M uh so if you go on
53:58that repo link and just go on Mosaic integration uh maybe you can uh work there there is it's a really simple app uh and actually that app is actually also uh available if you go on uh the r
54:17me uh which is over there we have a live
54:21live demo here which is hosted what you need here is to pass your mod doct to how do you get your Modoc token you just go to Modoc you can sign up for free um and you have basically on your setting page and you can uh copy here uh your token um so instead of uh running the
54:43the the live day just want to run it locally and do a work through so we have here a couple of visualization and so this is maybe you can help me true um this is what we talked before we have the VIS plot library right and here we are calling uh specific data set we that's the temporary table that you were mentioning
55:10to load the data yep so it doesn't have to be a temporary table but since it's usually only session based it can be this is then loading in um earthquakes data um it looks like this is even an older version of of mosaics you'll see there's a bunch of stuff we probably don't need to talk about um that are using the Topo
55:30Json library in the client which processes Geographic data the good news is that um duck DB spatial extension um is out and that can actually process um Geographic data types and so now we actually do all of that in the database as well so some of the boiler plate you see here to like get you know countries
55:48or states or counties to show up on a map you can actually do purely through duct DB and Mosaic without any um addition libraries as you might see in this example yeah and so here we have a the defination uh of uh of the graph so we have another one uh with uh the from
56:10the flights uh delay uh data set and and
56:15those thing basically as we saw earlier we can Define it also either through gson or yamal if you want to prefer the declarative way um cool uh actually I have it already running uh but just so you know you just do uh an npm uh install if you want to play around for people following up let me just remove
56:36uh that command uh now that you're
56:40seeing and uh and then you can do uh an
56:45runev and opening it here I'm just going
56:50to put it uh over there and so once
56:55you're once you're there because we are relying on the data set that is on modu you pass your modu token I'm just going to uh copy in uh my clipboard my mother
57:07dock token and when it's going to
57:15happen so let me bring you
57:20back so I'm connecting and now basically uh of course it's not working because it's a demo so I'm going to just write double check my mother du token once a minute because it's probably I know I did um I actually played with uh yesterday night just out of curiosity I built a just completely from scratch a mosaic app uh with Mother deck um and it
57:43was basically Mosaic and just had to build a a custom um query connector and and it was
57:51just in in an hour or so had the had the app working so it was it was pretty cool yeah so um here we go we have uh
58:04the the the different data set and so for example if I take this one so this is already um 10 millions of rows right
58:13so it's quite sensitive so it takes uh just a bit of time sometime to uh uh to load it from mother deck but once it's loaded um then you have basically the interactivity we so basically on uh on the website right and uh you
58:30have multiple V uh that we so uh hob
58:35here um which is basically the definition so that's the different dat devation we have Gia Stars uh the NPD complaint the Seattle weers that's a basically just stolen from uh what we uh
58:50saw earlier from your website um so in
58:54the Seattle water here is the same so the difference here is that uh it's coming from other deck and you can also create shares and shares the data set basically shared with others uh to build uh to build vation but you get that same uh Snappy thing yeah there's a couple things to highlight here so um I think by the way
59:17I think the original example was I think Jake made it Jak W plus made it for Al and we took it light and then kind of made it way into uh lots of demos and uh in eventually mosic but one thing to highlight here that's that's I think cool about using um using meod deck rather than for instance the web
59:36assembly connector or sorry the the web assembly uh version so in the web assembly version you would take the data set uh that's somewhere sit somewhere on a server and uh when you load Mosaic into a web page it would load the full data set um into into d would create the full data set um create these these
59:59indices that Jeff talked about that make interactions really fast uh it would only pull the columns that you need but it would still pull in pull in all of the data um you can also use M with a
60:10server where you have to to be running on a server and all requests go to that server um what that means is you have to have a server running somewhere um but your client can be super super lightweight because it only has to be able to request request or send the request to the to the server and and get
60:29the responses back uh with Mother you could get kind of a hybrid version where you don't have to have have to load the full data set in um but you also don't have to run your own server uh because another deck already runs it and the the indices can stay local uh
60:48because you're still running a web assembly version of msic uh which means that even if your server or the myod servers are a little bit further away from you and there is some latency you really don't want to have to go to the server for every single interaction every single movement of uh the brush here for instance and not for your Cloud
61:05Bill neither
61:09right so that what what you can do with this hybrid execution here is that you can have your data somewhere else you can you can as you said share it with other people it could be a very large data set you run the expensive queries the queries that have to be over the full data run them closer to the
61:24data and then uh move the indices which are really important for realtime interactions move them as close as possible to the user uh the browser so
61:35into the client and so that I think is something something exciting here that that's happening um yeah yeah so thanks thanks for the call out I wanted to I'm happy to uh happy that you you called that out and so to to wrap up is that we basically um in a Smart Way leverage a local compute in the cloud compute to
61:57upload certain things to the cloud and still uh keep um your local compute to
62:03uh improve the active interactivity of the of the delation thanks to Doug dbas and and Mosaic yeah it's not quite that by the way yet I I brushed over some details but it's mostly because of some implementation things um which we're talking about on slack so yeah yeah of course today so that's that's good to wrap up that uh hybrid execution at M
62:25still able develop uh and uh a state actually in the repo in the in the was client for for mod deck um it's a by the way this this one client is also an helper if you just U starting with dou DB was that's also a nice way to start and you have connection directly ready for for you for mck but yeah there is a
62:49couple of things that we we can do uh it's still in details at the moment mostly loading the data locally but we we can be more smart as uh Dom you mentioned yeah but it's already pretty cool so the the demo I built yesterday evening was uh using the Gaia data set and I was able to pretty comfortably
63:07explore I think like 18 million data points um yeah I think the the sample that we use on the on the Mosaic website is a five million Point sample and that's just because that is as close as possible to 100 megabytes which is what GitHub allows us GitHub Pages allows us to serve that I see I see a lot of the limits currently come
63:30from that we okay we we use use get Pages for hosting here um so that's already like the the mother du version um is already like four times larger than that so that was pretty yeah uh no that's uh that's cool we have uh we are a bit over time but I want to go uh quickly to understand a bit uh how
63:50things works in the python word because we I'm sure we have a lot of python user that uh that's watching um the show and so you have a library a mosaic widget can you uh work the me it through to that Library what it does behind the scene yeah so the core idea is that M
64:12right sits between the visualizations and some some database and the database can run um on in web assembly or it could run on a server connected VI uh a websocket connection or an HTTP connection or or it could run in a Jupiter kernel uh and we can use the uh Jupiter communication interfaces to let you use Mosaic in a in Jupiter and so
64:36what we built is a fairly lightweight widget on top of any widget from Trevor who's actually in the in the chat um and what it does is it lets you give it a mosaic spec um so this is using this this yaml or or Json spec um and then
64:56creates a mosaic um application that
65:01that uses uh the data that's duct Tob running in the kernel um to run the queries and the cool thing is those the the the duct that's running in the kernel uh so in the Jupiter kernel uh it can actually access uh Panda's data frames that are in the content and so that's what's happening here so we're
65:22loading this uh with pendas the read reading this this this we dat this dat data frame and then we have here the specification of a chart and you can see it's hard for me to read the it's kind of blurry but somewhere in there I should use uh the data set I think it just says from from weather maybe yeah
65:43uh yeah here it's it's the this subject right no it it says from weather in the in the in the spec uh you scroll down again a little bit uh and so that name has to be resolved somewhere okay uh and the way we're resolving is that we're telling the widget here that oh for this use the pendis data frame yeah uh that's
66:04weather um yeah and then that's
66:10rering a uh all right and so here if I'm if I'm
66:16saying thing there is little there is less there is there is little delay added yeah is it because it's going to the to the to the python kernel instead of the the was or is this more because it's collab and it's on cab well this is not because you're not running locally you're sending this across to that's true a
66:39Google data center somewhere yeah so the the way we originally built this Mosaic widget was for Jupiter running locally so you would only quer a local host but what you're doing here is you're using the same widget in collab and so every request now has to go to the server uh to the Google server that's running collab and be resolved so here is
67:02actually also an opportunity for this kind of hybrid execution where maybe we could cach the uh or have the indices locally maybe in a was version but still run the uh the expensive corers over um
67:16over the kernel over the kernel so that's an opportunity for definitely optimization here uh it was actually very funny the first when I downloaded this once I was in Germany and connecting to a server in the US so in that demo was quite laggy yeah I guess that's what happening to you you're in Germany yeah if you were on
67:34the west coast it might be slightly faster I mean I I guess the like the collab is at least a bit more smart to uh Paran we do have data center in you don't tell like there is nothing I don't know that's that that's a good wrap up and by the way so you can access this um notebook if you go to uh
67:57the documentation website on the Mosaic Jupiter um you have uh here the the link to this collab I I just run if you want to uh to play around and if you want also to just uh download the you know the notebook and run it locally and see if the performance is uh is better it should should be better um we are we're
68:18going to close in um because we are already over time and I thank you for um for be for for being patient uh with all my question uh I had a way more question I think we could run for uh for two hours but I think it's a it's a good wrap on uh basically what uh the the we
68:39talked a bit about the the heritage of of Mosaic right through VGA and VGA light and what leads to to the development of this um how it work with dougb and uh how dgdb was missp powering um Mosaic and then we did uh two work through the one using just the the JavaScript um client with the
69:04the link I I give you there is the wasm client from Mod duck with a mosaic example and then we finish uh with the notebook is there any closing FS you would like to to add Jeffrey and Dominic we can start with you Jeffrey if you want sure I would just first say um you know Mosaic builds on a lot of other
69:27software um packages um we've mentioned any widget observable plot obviously it borrows a lot from vegga light um ducky be so on so we just want to also thank the community I think this is a really an open source success I couldn't imagine piecing together something like this even like five years ago let alone 10 years ago and so it's really amazing
69:46to have all these tools to come together um and then also just yeah encourage people who are excited about this uh to get involved again share uh feedback um bug reports if you want to get involved in helping with Mosaic or sharing examples um just reach out we have both um issues and discussions on GI Hub and
70:04and we'd welcome your engagement thanks cool thank you Dominique any last words yeah yeah so I guess going from the past like there's a lot we build on and and going into the future lots of things we where where this can go really Mosaic I don't as it is right now it's not the end I think it's a it's a middle point
70:23towards uh many more many more things so um be it like these optimizations we just talked about or uh Trevor actually had an interesting idea about being able to register new clients in the language or maybe having instead of having to have the yaml in Python could you maybe have a python API uh so there's lots of
70:44uh cool other things we could build on top of this so we really think of Mosaic as as an infrastructure that other ideas can can can can build on um that's what I would I'm hoping for in the future cool thank you very much for uh joining me uh this was quite in COD show it's happening every other week uh so if you
71:07uh want to be the one to ask the question Dominque or Jeffrey you're happy to join in two weeks we're we're actually gonna talk about I think the next one is geod dat NDB uh so we we we
71:20mention uh this with uh the optimization you've been doing so I guess you've been using that extension as well um and uh and I think after that it will be around the CSV sniffer in ddb uh you can what have all the events on mother du events it's over there and it's going to be the one next one announced soon I'll thank
71:40you again and have a great day or have a great lunch or have a great breakfast I don't know from where you were watching us and uh see you soon happy
71:51hour
Related Videos
2026-01-21
The MCP Sessions - Vol 2: Supply Chain Analytics
Jacob and Alex from MotherDuck query data using the MotherDuck MCP. Watch as they analyze 180,000 rows of shipment data through conversational AI, uncovering late delivery patterns, profitability insights, and operational trends with no SQL required!
Stream
AI, ML and LLMs
MotherDuck Features
SQL
BI & Visualization
Tutorial
2026-01-13
The MCP Sessions Vol. 1: Sports Analytics
Watch us dive into NFL playoff odds and PGA Tour stats using using MotherDuck's MCP server with Claude. See how to analyze data, build visualizations, and iterate on insights in real-time using natural language queries and DuckDB.
AI, ML and LLMs
SQL
MotherDuck Features
Tutorial
BI & Visualization
Ecosystem

2025-11-19
LLMs Meet Data Warehouses: Reliable AI Agents for Business Analytics
LLMs excel at natural language understanding but struggle with factual accuracy when aggregating business data. Ryan Boyd explores the architectural patterns needed to make LLMs work effectively alongside analytics databases.
AI, ML and LLMs
MotherDuck Features
SQL
Talk
Python
BI & Visualization

