Spatial data management with DuckDB ft. @MattForrest
2024/03/28Featuring:Mehdi and his guest Matt Forrest quack and code - diving into Geo data with DuckDB and explore the spatial extension.
Transcript
0:00[Music]
0:14hello everybody and welcome to another episode of quack and Cat and today we're going to discuss uh geod data it's all about uh uh geod data it's going to be
0:29really f because I have uh basically an expert with me uh which knows a lot about uh geod dat has has a YouTube channel we'll come to that later and so um Matt welcome you're
0:46already there I believe yeah hey how you
0:50doing so I'm good I'm good yeah happy to be here and um you know I think duck DB is is just so exciting uh for geospatial and and I've just been using it more and more so I'm excited to kind of share some of this fun stuff I've been up to and yeah talk more about geospatial so yeah that's great so I'm a bit uh a
1:13newbie around uh geospatial data and uh
1:19I want a bit to I think to take the the session into two times where we dive a bit into your background dive uh into the basic of GEOS spatial data for people that are completely newbie in that and maybe a bit highlight some some point for some expert because I I guess there is people uh more interesting
1:40familiar with geometric data and so on um and then dive into approximal example how does that sound that's perfect yep uh good um so yeah tell us a bit more about your your background and how how did you get into Geo dat yeah um well it it's kind of logical I went to school and got a degree in
2:06geography which which sort of makes sense that's kind of how I ended up doing it um but I didn't have any of like the programming or technical background um you know a geography program you know has a few different components you know sometimes there's human geography and physical geography kind of like geology and and all these different things but you're always
2:26studying something about the Earth um and then there's you know kind of the practice of you know what's called sometimes GIS or geographic information systems or you know spatial analytics is is kind of where I think things are going now but we we'll talk more about that um uses geospatial data to kind of test these different you know hypotheses
2:46or understand something about the world so you know that's where are people moving where are people living um how are you know different Trends taking place on the earth VI it climate change or Transportation or anything like that um so you know I honestly my my programming background was pretty limited I left school uh with basically knowing action script which became very
3:06defunct very quickly yeah um so you know I kind of found myself having to learn and pick things up and um so just kind of teaching myself as as I went along and and um you know lo and behold I I found and met some folks at cardo um Javier who was the the founder and um and kind of just started working with
3:27them and they were building a lot of things on on on post CIS which is the spatial extension to postgress um so that's kind of where I got into the SQL side and then I started you know learning those tools building applications learning some different things over there some of the data science side so you know it's kind of
3:43just picking it up by default but I I had that background and kind of knowing okay here's what I want to understand from a you know from a geography or from the the thing I want to study and then here's I need to get the data to that place and that's usually how I would apply these different tools so that's
3:58that's always where I've been interested in is kind of how can you apply different tools to get that outcome to you know find something more in depth scale to more data whatever that might be you know I can kind of use that and and duck DB even just within the past year you know I think we're pretty close to when
4:13the spatial extension became a core part of the duct to be release is like one of those big things so I've just kind of picked those things up as as I've gone through my career and then um I I started talking more about them probably 2021 just kind of posting things and apparently there's a lot of people
4:29interested in that which is cool so I've just been continuing to do that over time it's a it's a niche topic right it's basically like your YouTube channel is mostly about this right yeah yeah so
4:42I tend to focus on that it's it's sort of the what I know I mean we bring in a lot of elements from you know the modern Data stock or from you know cloud or local I mean so there's there's a whole kind of other world of you know anything that's kind of you know data analytics there's spatial analytics if you think
5:01of uh you know data engineering there's geospatial data engineering and there's funky data types you know both in Vector data which is what we would call a geometry which is what we'll talk about today but things like imagery are actually stored in you know like a tiff file so you have to analyze these images and then turn them into different things
5:20so there's this whole kind of other piece of of you know how to do that and especially when you know the volume of data has just been increasing like just as it has anywhere else you need some different ways to do that and you know the traditional world is kind of you have like servers and you know on Prem
5:37and you know they can kind of only do so much um and then there's kind of been a shift to some of these Cloud tools that you know will you know kind of scale but of course there's sort of a level of expense to them um you know so you know a lot of people work to find these
5:52different pockets of compute where they can like Google Earth engine is a really good example of that where they allow you to do that you know for the most part for the academic Community for free um and then you know I think what's really compelling about you know what duck DB and mother duck are up to is to
6:06say you know you can use really take advantage of what's already on your computer or when you dis scale it you can go elsewhere too so you know it's it's we it it is super Niche like you know it's you know kind of this this other piece but it's accessible to anyone too like the geometry is just
6:21another data type like you know varar and float and then you have geometry and that's you know it's just there it's what you do with that that's a little different than most so yeah and it I think you mentioned something like there there is um basically specific data format you mentioned div uh which is so shape file that's what we call it I'm
6:45completely newb so you have to educate me on this session right yeah so we have yeah go ahead yes so you
6:53have um there's this Library called gal which is like the Rosetta Stone of like geospatial data types and I think there's almost 200 different individual file formats that you can store vector and raster data in um each with their different advantages some of them are like hypers specific to like something someone developed in like one there's like a Norwegian specific file type you
7:13know there's just so many different things um so you know I think you know ultimately that's one difficult part and there's some file types that are easier to work with than others um you know so I'll since we're talking about Vector today I'll stick to that that's kind of you know we're talking about points lines and and polygons effectively that
7:31you store in that geometry so point is a you know infinitely small exact location on the surface of the Earth a line is one or more points connected in an order and then a polygon is that in a ring or multiple rings that compose of a space so if you think of like a polygon could be like a zip code or a a circle around
7:50an area and things like that um so you know there's there's all these file types and what's what's great about the spatial extension with induct DB is that it basically ports those same readers in to one single function called St read so I can have my files I can read nearly the there's a lot of the ones that are
8:09mostly the the most you know currently used or most popular you can read in there um and then I think the the big thing that's happening right now is pushing towards like more modern file formats um so geop parque is kind of a big thing that's been pushing on which is adding geospatial support to the you know paret file type system which you
8:27know makes things a lot smaller toore more and faster to read so yeah cool so yeah when like uh that was one of my question I however you have a lot of question first you mentioned uh so uh giop par um is it
8:45like as when this done this kind of a standard is popping up or is still really early how do you see that Trends coming in I mean it's still it's still early um
8:57earlyish um there's there's two kind of of parallel things happening which is geop parque and then Geo Arrow which is a you know an arrow specification for geospatial data and there's a there's a really kind of growing ecosystem around those tools so geop par just had its 1.0 release I think in the fall so it it was
9:17in a you know kind of pre-release candidate um as of last year and I think into 2022 um but it's it's active now more and more people are using it um you know there's different like you know data repositories that are building off of this uh Source Cooperative is probably the most notable one um Chris Holmes who's from the is a as a governing body
9:39of geospatial data called the open geospatial Consortium um so he's leading the researcher on that within the ogc and uh he's put up Source Cooperative which is sort of like this open data platform for these pretty large data sets that you really in older formats you couldn't just Store and download but since you can do things like partitioning and you know even
10:00spatial partitions you know grouping them by clusters of data uh it makes it so much easier to access them and they're just hosted online so you know with you know it it's there's a couple great tutorials he's published about grabbing some data from from Source cooperative and these are like open data sets of like every building in the world
10:16or you know every point of Interest or stuff like that so um you can actually just download those or not even download just read them as the file URLs into something like Duck DB and then just start you know reading your your ad your analysis around that so it makes it super effective to do that you mentioned
10:34like open data sets uh what what is the common repository today like you go to for you know GE data and what what is the reference and what is do you see growing yeah there's there's I mean the governments publish a lot of data so like you know the US Census publishes their their own data and uh you know
10:57every time they release a census there's a lot of private companies that publish their own data uh and then you know more and more you see like the you know the Google and the Microsoft's the world anyone's kind of building like an underlying base map Apple Maps Google there a couple other companies doing this they're all kind of behind the
11:15scenes contributing to a project called open Street which is uh you can go to openstreetmap.org it's been around for many years it's kind of like Wikipedia for maps so if I like go in and draw a little boundary and like call it my house and add some attribute data that's now in the database so it's very easy to
11:31actually go and get anything that's on that base map as data um and once again duck DB has functions to read in OBM sort of Base files which makes it super effective to grab that data and just start using it so there's data sets like that and uh Overture Maps is another really interesting one that's that's hosted on Source Cooperative so there's
11:51there's a ton happening in that space right now the governments keep adding more and more data and then on the flip side you have all this satellite imagery which is a whole other thing to unpack one have time to today but yeah I mean there's there's just there's data everywhere I mean if you think of it there's probably something like I was
12:07looking for Wildfire risk indices and there's a there's a satellite that publishes this every single day that you know the EU and multiple governments pay for that says here's the Wildfire risk and you're time to have to go get it and once you do you can make really great use of it yeah so the open street map is
12:25definitely something uh I work in the past but I think like it's it's always uh a bit hard to treat the source data like if you it can be roll back a bit uh like before ddb and like you have you made a presentation uh with a nice slide I'm stealing with the modern g stack can we
12:49can we uh go to to that one let me uh share my screen yeah so yeah so this one
12:57basically um which is yeah can you can you work a bit through based on what we discussed so far and uh and the point I think you mentioned that uh bges which is an extension for pogress right is still heavily at the center so I'm curious to hear like how just a bit of
13:19the history for people not familiar how come it's it's still at the center and how do you see it evolving in the future yeah I I think we're you know postgis is is really Central is that it has the most spatial functionality in terms of the functions and if you look in the you know the function definitions
13:39any function that leads with st it's a spatial type it's just kind of the common you know the way you call that you know different databases will support varying amounts of those between even the data warehouses um and things like that but post just has thousands of them and I mean there's there's one to do kind of
13:57everything you need um you know like you know there's Center of mass you know there's you know kind of you know there's you can create one that's like the least accessible point in any given polygon um you know and then there's even extensions on top of post gist we'll talk about one today but there's you know like PG routing which is a
14:16routing engine to you know kind of do what you would in Google maps to draw you know uh you know your here's where I would drive to and but you can do it on any network so what I you know we'll we'll do today is pull in just bike routes and kind of you know rate those so there's all these different things
14:31you can do with that so it it be just with the sheer amount of spatial functionality um and then on top of that the support for projections so of course you know the Earth is not flat so you have to you know basically have some mathematical calculation to turn that data into a flat surface and there's hundreds of these different projections
14:50for various means you know you can use them to preserve different attributes but there's a few common ones that are used and and it specifically relates to things like measurements you know if you want to measure it you want to be in meters some show up in decimal degrees and stuff like that um so if you've ever
15:07there's you know if you look at Google Maps like things look weird and out of shape like Greenland looks like this massive thing that's just because of the projection that they chose and like some of the engineering choices they made but you know there's the the actual world of course looks looks different in scale but the measurements are preserved in
15:22something like that so that's why post just kind of remains this core but it has so much other connectivity into different places like you know it's it's easy to read in files both vector and master into into a post database and then you can um you know you can do a lot of things there but of course the
15:40the big one is kind of like when you need to go scale right and you need you have either larger data volumes in terms of you know number of rows or things like that and then you have to start to kind of explore okay what else can we do here and I think there's this interesting crosswalk where you know
15:57duct DB can kind of provide that olap functionality that you know can help you go to larger volumes and one of the specific challenges I see a lot is like where there's maybe some spatial data where it's not necessarily large you know like what we'll look at today is like the New York City city bike system there's about 1,700 stations um there's
16:16there's more in the data set because there's they use temporary locations sometimes but that's not by any means large data that's 1700 points you know nothing big but there's millions and millions of trip records that you want to be able to interact with that so you don't necessarily need a a transactional database to be able to query that you
16:34need something else to do that bit and then you can use the spatial functionality to do other things like maybe there's spatial clustering you want to do maybe you want to you know create sort of ride territories or you know grouping or whatever you know you can kind of use both tools for for what they're really good for so that's kind
16:50of the the back and forth between that is is kind of how I see these things evolving which is which is really good I think there's a healthy kind of balance between how you can work with all those together yeah no that that makes uh total sense and I think I mean POG in analytics in general has been there for
17:07you know a long time and I feel it's a for data team it's always like the first
17:13step for analysis if you build your you know your SAS business or whatsoever on pest it's like you start creting that and pushing this to some extent and then when you start to run you know larger query you you should the up that and use a proper all app system but um but then it's it's basically uh one one fit the one use
17:38case that fit them all right as you mentioned uh exactly in data and so on this on this image we have we mentioned geop par which is you know kind of the new format you mentioned a couple of government uh website which is man uh maintaining uh Geo data and uh yeah
18:00yeah and and I mean it's if you're like Curious to look around for this stuff there's like a good chance that your you know country your Province your city maintains some sort of geospatial data like there's you know there's lots of open data out there and and geodata tends to be a pretty predominant one so there's there's there's tons of stuff
18:19that's available to to play around with so and so we mentioned already open street map uh that's also a big reference and uh gal which is the library use um you know for for injetion and projection also I guess m yeah gal
18:38allows you to move between file sources you can move from raster to vector and vice versa so there's a ton you can do and it's kind of like ubiquitous within the geospatial ecosystem like nearly any tool you can find it's kind of like it's it's that's what duck DB uses that's what geopandas uses that's what Hostess uses it's it's everywhere you know you
18:58can't so it's kind of like the arrow for geo dat it's like the man behind everything yeah and there's Arrow coming now too which makes things even better as there's you know if you can read Arrow then you know even even better yeah yeah yeah I guess um it's so because Arrow also has a has a lot of
19:17adoption for um communication between data uh processing engine and um just libraries and ter probability so one is case for example just for people not aware is that that arrow is eily used within ddb to convert the panda data frame um easily so you can create your Ponda data frame or po that a frame directly from Doug DB thanks to Arrow so
19:41regarding storage we can talk about it uh that I think pretty obvious I am I'm a bit more curious about um yes the
19:50visualization part can you can you touch down a bit on on those two lane yeah yeah there's a few um I I I just published another update to this as well so there's a few kind of additions there um but I kind of I I bucket it out as there's kind of when you look at like analytical platforms that allow you to
20:08do some orchestration uh the company I work for cardo is is one of those where we kind of do a lot of that orchestration on top of like a cloud data warehouse ecosystem so you know you can kind of do all that and then easily pass that into to visualization um you know so that that allows you to create
20:24you know map visualizations with lots and lots of data and it's it's um you know that's one piece there there's also other tools um whereabouts is a company that's built off on top of Apache Sedona um so that's kind of uh it's it's like the spatial it's like spatial spark if you kind of think of it that way oh um
20:42you know so it it allows you to you know do that you know and of course it requires spark and all those different things but it you know kind of has those pieces and there's some visualization built into that what I think is really interesting is is kind of there's there's a few things that are coming out
20:54right now that are kind of being built um so uh in just ter terms of like creating Maps has a cool technology called felt which like allows you it's kind of like a feels like notion for maps it's very easy to like collaborate and visualize and do some stuff like that um and then another company that I
21:09think it is it's like you know a couple weeks old but um it's called fuse that allows you to build like these little function building blocks called udfs within you know python code which you could easily you know wrap duct DB and Python and then kind of scales that from a serverless perspective but also how allows you to scale the
21:25visualization um and then uh there's a really amazing Library if you're if you're a python user uh called lawn board L NB o a d uh it's built by a company called development seed and that allows you to basically take and visualize data but it does it all with Geo Arrow so it does it within uh you
21:43know that that Arrow transfer protocol and you can visualize you know millions and millions of Records uh right in your notebook uh just with a few lines of code so that's an amazing process and I think there's this interesting interplay between you know geop par you know manipulating your data with duck DB passing that out and in a data
22:03frame and then really writing that right up into you know a map you can do that in just a few steps now which before this took a lot so much effort to just go from data to map and and without getting into the weeds of how the the map visualization technology Works um it used to be really tricky to do you have
22:21to you know generate these static map tiles as they're called um you know and and you would have to have either some library or some server to do that now it's just you know file you know manipulate your data map and I think that's like the the really cool part so yeah you're showing lawn board right now which is like just if
22:40you want to get started showing a bunch of data that's a cool place to start uh just wait I'm really sorry but
22:48we had the crash and I'm not sure I think we're I think we're back we're back that's fine just a small like um no but that's uh that's great and yeah so this uh longboard uh uh longboard thing is uh is over here so what I suggest is that we have oh yeah we already half time so we can go
23:17uh to uh the code Source uh that you
23:21shared uh with me uh I need to find it actually back wait a minute and we're going to just uh go over that so you can go um for people
23:36following along
23:48um can you still see my screen some weird uh eup but let's hope
23:55uh let's hope it's running um so yeah yeah uh we have uh basically this repository you can go um for uh I'll put the link also in in the in the description and um and basically we're going to go to a couple of example that just SQL file and Doug dbcli I guess
24:19yeah yeah so what I'm what I'm doing is I mean uh basically building off of your many tutorials kind of just going duct TB nvs code and then you know writing and and you know interacting with data but we'll interact both with mother duck um and then also with uh post so I have um you know that that part of the code
24:39um you know is there's a few extra steps there um you know but I can I can cover that it's uh you know to talk about the what you would need to to get there but just to actually do some routing um so yeah do you want me to share my screen or uh yeah I think we do we do have an
24:55issue with the live stream uh because I do received some uh some comment so let me uh check this out one
25:20minute all right we are back hopefully sorry for that um so we were uh just
25:27about uh diving into uh the
25:32codes so what we GNA do uh basically here is uh just go
25:41over uh the code which is
25:45um on gup and we're going to f uh a couple of uh of example in GE data let
25:53me just uh fix uh this quickly
26:24there we go so we're back oh yeah so
26:29sorry for that so we uh have uh we're going to just run the the CLI I guess here and as you mentioned you're following the same uh setup uh than me where basically I'm sending uh queries uh to the terminal right and so it's nice because that's enable you to write to go into a single SQL file right and we have a we are
26:57using vs code here here and so I can just uh send this and uh Des send it to my terminal over here I want to try something so I'm launching relaunching Doug DB because I think there is a couple of like load which is not needed anymore um because there is a automated ook uh internally from dgdb which is
27:19detecting and installing and loading an extension directly for you um so yeah
27:25what you want to talk maybe quickly about like those those two why why those two are needed yeah so when you when you load in the spatial extension that um basically allows you access to all those like St functions that I was mentioning as well as some of the special ones to kind of read in that that data so you know you
27:44have that St read function um so if you if you go to the documentation um you know you you have that function you those those libraries now and that allows you to basically access that geometry data type so I can both convert geometries from different types if there's text or binary or geojson which are different formats to
28:04form that and then you can create geometries from values like a really common one is St point which basically allows you to take like a latitude longitude and turn that into an actual point right so that's one you can do there so let's say you have a point and then you want to say Okay I want to make
28:20that a buffer so show me like turn that into a circle with you know like 100 meters around it then you use St buffer to do that so it takes the geometry as an argument and then then the radius that you want to create around that so there's all these there's lots and lots of functions that you can see you can do
28:35these different things like I can measure if there's a polygon in area or I can pull peel off the lat long you know columns from that and things like that so this gives you access to all of these and then the special functions that are within dctb is this St read so that you know makes it so I can just
28:50read that raw you know file that's on my
28:54you know on my computer to just read that as a table you know so be that in you know in your python be it in a CLI wherever you want to work with and it supports all these different file types so I mentioned there's a lot of these different types of files this is just a sample of the things that you can use
29:11right um and then the osm data you can actually pull down these files are called sm. PBF and those are just big batches of of open street map data there's a few other really cool tools that people are building around this there's a nice Library called quack osm um and you can actually pull that and read some data there just using you know
29:30there's a lot of filtering there's there's so many tags within this like if you think about it there's you know it's kind of like the node and ID geometry um things like that and then this tags which could be basically hundreds of different types of like nested Json which is kind of crazy but quack kind of helps you filter that to say okay I
29:50really only care about roads or maybe roads that bikes are safe on and things like that so that's mention conversion I just wanted to highlight like from the the DB extension that you have multiple way to do uh conversion from other format that's what you were mentioning right because there is multi multiple of them and so if we go back to uh to the codes
30:16we have uh so we mentioned also you can do all those command of course you can uh you don't need to use the ddb CLI you can use just the notebook and the python uh python client and run this into to into a notebook um so we have the special extension https for connecting this also is not in need it anymore I
30:36think it's because uh before we did it specific the region but I suspect let's try actually uh is going to work uh
30:46without so you see yeah so I think you uh it's detecting the region now automatically based on the on based on the so we have write some data here so
31:00more than three million row right and so
31:04can you talk to us a bit to those those spark sets yeah so those
31:11are all coming from Source Cooperative um which is it's B called beta. source. Coop and this is basically that really massive data set that I mentioned um that's looking at it's like a combination of your Google and Microsoft open buildings data set set so basically Microsoft trained this super computer to go out and detect building polygons uh
31:34from satellite imagery and then they just publish that so there's like two billion Footprints you know Nationwide but this is all put into geop par that allows you to you know do things like this and query it at scale really really quickly and you know go from there so allows you to scale that up quite a bit
31:49to go you know and actually use that and then Source Cooperative basically gives you access to the partitioned you know data so it partitions it by usually like country or this indexing method called S2 um and there's a few different ones of that you know so it's actually like a spatial partition more or less so you can actually just you know I only care
32:08about Norway in this case I can just grab Norway and then be on my way so it allows me to do that really much more effectively cool so yeah um so here we we can describe the the schema so just is to it's just to describe and then the URL uh from a stre so this is a public
32:25bucket so I don't need to do anything and so we see we have uh geometry uh blob and uh different different things
32:36like country ISO what's the confidence that is the basically the confidence in terms of that thing you know existing as a actual building so of course you know you're you're training basically these like almost like a deep learning model to detect that as you know this is a building so it gives you some level of confidence like we think
32:56this is definitely a building or maybe not not so yeah yeah so I see so um so
33:02yeah let's let's uh go for it and create uh that table and I'm wondering so I haven't load the special extension uh so I think it's going to I might there might be a hook here or I might get an error based on that so uh let's friend I want to check it okay so here I have
33:22actually a a specific warning which is nice um so my point was that when I'm doing uh a query on a verse Tre is actually using the HTTP FS extension be the scene and I don't need to install and load those because there is specific hook but for this as you see is detecting that I'm calling a function
33:42which is part of the SP shell extension so I'm just going to install it uh and load and uh just create again this so
33:53what are we doing with this we taking just the data from llo is that it yeah so so we're doing is we're creating um you know like a uh basically finding around a centroid in Oslo so what you know I'm creating a table because after that I'm going to basically export that out of geojson and then once we swap
34:14over I can actually show you there's a really nice extension in vso code to just make a map really quickly so I just can select my gojon and throw it in a map so as you can see here I'm using this this function STG from wkb and what that's reading it's reading the well-known binary which is the how
34:30geop par represents the geometry and then turning that into a geometry that duct DB can can interpret um and then I'm reading that directly from that file type and then the the last bit is you can see there's a wear clause which is this STD within um so that basically represents St distance within um and what I'm doing it's kind of funky
34:49because the projection there's the it's in a decimal degrees so it look like this oblong oval which is odd but it's just a simple way to filter it so what I'm doing you provide it two three arguments first is the geometry that you want to filter this is from the table the other one is why you want to that
35:05you want to measure it from which is this point is a lat long representation of you know approximately the centroid of Oslo and then 0. one is basically the distance that I want to measure from so that's in decimal degrees so it's it's not perfect but it gives us a nice way to just kind of grab a chunk of data to
35:21see and then we can do that so it allows you to just query it there and then so you can select that and then you have a table of I think it's like 50 some thousand buildings you know within Osa do I just question how do you differentiate a baret from a standard geop parket just by the data type which
35:39is inside yeah just just the geometry you'll have that geometry and and wellknown binary and that that's pretty much it um and then you know there's optimizations made on that column you know for indexing and and things like that great uh so now we have our table
35:55uh which is created uh in memory ddb because I haven't uh passed anything uh when launching dgdb so it's everything is still in memory so if I quit everything is gone um and so now
36:11you uh want to copy to G goon right yeah
36:16uh what what kind of format is used uh usually G gon can you work us through that geojson can be used uh just as any other standard Json I think for a while it was the web standard you know to actually kind of either put that on on top of a map so as libraries you can just put that on top of a map um and
36:36then uh it also allows you to you know it's it's it's fairly effective to share and store so if you loaded that into GitHub peray um you know it would actually just if you clicked on the file you could actually see a little map that pops up so it knows that it's geojson and creates a little map there what uh
36:52what you can do is that within vs code there's a an extension called um me get the exact name because I always forget the exact name of it but um me go by my ID G Jon sneit viewer it's called actually um I
37:11think it's called geodata viewer I see it so it has 150k uh I have it uh already so I guess
37:21I need to click here yep so if you all you have to do is create the geojson itself yeah and then on the left if you right click on the file that's created you'll see a little option that says map and then it will just create a nice out for you it's done yeah it's done here yeah
37:40so yeah that's that's that's pretty cool so that's like really how you can quickly interact like process data and visualize this how this like you've been you're like you've been in the domain for a couple of years how this like this kind of workflow change for you compared to to before like preu I would say the number one thing that's like making a
38:01difference is that you know before you always had to like load data somewhere you know so either there's like you have to pull the file into a notebook read it you know and that operation could take some time you know but if you just wanted to see what's in there or quickly visualize a chunk of that data this
38:20makes it so much faster just to do that so I can go from you know without really you know like loading a data file you know like you have to you know take the file and put it into your postgress or do something like that that kind of like IO operation is like significantly decreased plus the ability to do that
38:35exploratory work especially on much larger data sets is is way more efficient and that's kind of a big chunk of what you do is to say you have these much larger data sets and you just want to do some joins or Aggregates or you know like search by time or whatever that's been significantly decreased so your time from you know file to
38:53exploration to you know some sort of analysis goes much faster and then
39:14[Music]
39:58the schema with just one line right without any different dependencies in the CLI so I'm using this actually all the time uh and I would recommend like it's not because like we it's a duct TB team uh session but any data engineer that's like even if you're not actively using ddb it's just a toolkit uh for uh
40:19quick exploration um that's pretty Andy so that's cool so we just did a geog Jon based on uh spal data did a quick filtering and display it on vs code so that's was the first one we can maybe try um uh this one I haven't done any
40:38maybe I can um uh this I have the second one that is in the folder 2024 0327 I
40:46have it on my on my machine so I can run that there's a few other dependencies I can walk through if that if that helps yeah yeah yeah let's uh let's do that uh
40:58so let me uh so I got I'm sharing the screen I don't know if that's coming through or not yeah it's not coming through I think let me double
41:15check uh but so in the meantime um maybe
41:20you can walk us so it is this this one you mentioned that you were saying there is extra dependencies yeah yeah yeah 0327 um yeah yeah so what what the other dependencies just just for for my understanding yeah so I already I loaded a bunch of data into into mother duck um to actually because it was really kind
41:42of longer form so you know like you know it's nearly uh 100ish million rows um so I just wanted that like it just gives that that big boost for you know like that that serverless side you know when you want to do some of those things really quickly so the data that's here is is from the city bikes data set um which is the Bike
42:02Share system in New York city so like I kind of mentioned it earlier is that you know there's 2300 stations both current and historical um that ex have existed at some period in time um and for every trip that's taken um you know there's basically a log of that in this open data so it has all these files and of
42:25course it's not clean data they switch schemas twice that's I think that's one big problem with the with the open source yeah they the I mean the open data sorry from government is that I think they still need education and you know EV evolving around those things I mean if they move already from xlight to CSV or
42:49parket that's already a great win I think like yeah New York's uh data is has been moving you know to the to parket file couple years back but there's still a lot like it's just Excel she and I think like yeah asking for uh stability in schema is I think another level of maturity that's it's not there
43:09yet yeah yeah yeah it's so it's it's weird because I think it shifted providers a few times like now Lyft owns the system so I think they're doing some of that and then in the past there was one scheme of switch around 2015 or 16 the system launched in 2013 and then in 2021 there's another switch because they
43:28added new bike types like a category because there's ebikes there's like kind of in between ebikes and then there's your regular bike so they can you can see that and there's some data about you know like the you know if it's if they have information about your gender and stuff like that uh if it's a member like
43:45you can have a membership or casual ride and then the weirdest thing that I found is that there used to be a consistent lat long for every Row for this end and start station but now it actually does it based on where you end it in the app so sometimes will log to the station and sometimes it will log like slightly off
44:00to like the left or the right it's it's very there's all these intricacies but what what is great is I was able to you know I I couldn't read them directly because they're zip csvs and I didn't use the same the right zipping standard which was another annoying thing but I I pulled all the DAT down created a little
44:16you know code snippet and then just started uploading you know these the parket files to mother duck and then I kind of pulled that into one giant table which is that um uh that City bike um
44:29you know it's I think it said City bike 2021 to present so within that there's like 98 million you know rides across the city which is which is you know pretty good I think that's you know having that is kind of going there um let me see I don't know if um but if it we I'll I'll suggest we go to to the to
44:46the code true uh because I have a slightly Technic to show your screen uh it's been bunky R today it's it's like that you know it's that that's show you that it's live there's nothing recorded but we'll definitely come back because I think what's nice is that with mod that you can share uh you know database so you
45:07could uh you could share this one and we can uh come back with a proper blog for
45:13that I'll get it cleaned up and yeah what what's great about is like it's effectively doing the same things where I can kind of see okay where's the top station you know like and there's this one station for whatever reason that 20 21st stre in Sixth Avenue that's like the consistently always the top um and then you know you can kind of see okay
45:33where are all those people going to and it's like really short neighborhood trips so for this this reason Chelsea is like the hot spot for City bikes and people riding just in the neighborhood um and then the last little bit was actually kind of attaching a postrest database um so you know it's I I just finished up a book last year on like you
45:52know spatial squl and doing that in postrest but actually working with PG routing so I'm taking that same open street map data and creating a network uh and prioritizing bike safe routes and then within that I can write this this there's this very long query on like line 62 to actually create the routes for the top 10 locations that that
46:12station goes to and I can actually end up and see you know here's the you know the the routes or the bike safe roads that are most used within that so you can actually go from this you know what I think is great is using all these in conjunction I can get some actual ins out there like where you know where
46:28should I prioritize you know bike lane investment or bike safety in the city or anything like that just based off all this data of you have millions of people traveling on bike you know dayto day and and a lot of this data exists for you know bike shares in different cities around the globe so you know it's it's a
46:43very cool data set that you can take and then have some practical something come out of it to say you know here's some recommendations you can make if you going through this this data so it just shows like this really cool and and I'm doing this all within the command line you know which is great I'm conting to
46:57my postgress over here I'm connecting to mother duck over here I get all the power that I need between the two so it's just this really nice Union of bringing all those together what what is missing for you today because so you mentioned like like the dependency to uh to BU so maybe let's let's go over uh
47:14the the code so here you're showing the top station the top neighborhood so this is just working with the Geo dat um and then from a specific uh neighborhood and so I guess okay so there is directly station name so you don't need GE data for that it's just yeah um the station name then uh here you what what is this explain maybe
47:40a bit to the audience which line are you on because I'm not seeing it on yeah the um the um uh and no you should see it yeah I'm following along so you tell me what line you're looking at I can yeah yeah for sure so install post um line yeah where so what I'm doing there is I basically took that that table with
48:05the top destinations from that station at West 21st 6th Avenue and what's great is I can just attach my postgress database um yeah and then um pull that over to write a table from Duck DB which is drive from data in mother duck and push that in my into my postest I can create a table or view whatever I want
48:25yeah and then what I'm doing here is writing a query for PG routing to actually draw those routes right so I'm saying from that station and to the top I think I picked the top 10 stations in the in you know that that uh people ride to there create those actual routes and using that prioritized bike Network that
48:46I've created so you can actually go from like one to many to to visualize all of those and you could go from you could go for every single one across the city if you wanted to and um just see that but what that's really nice is that like I said you know post just has that like Heavy Deep spatial capability that you
49:04know you may or may not want an noap right that that there's like you know it's kind of choosing the right tool for the right battle um but the fact that I can do all my data prep here and then still connect with really out leaving my my IDE in doing back back and forth is really really nice and you can just have
49:20that little bit over there so I think that that's like an incredibly like powerful connection and then I'm just SP out those routes I can create a map and see what's there so you know I don't have to you know go back and forth from you know PG admin here or what whatever you want you know it's all in one place
49:38which is great that's yeah so that the the in probability with buos and Doug DB is really nice and it work both sides so we have a lot of people that do their first analytics in Doug DB so they run the compute in D DB from uh from pest so they it's it's the other way around than
49:56here right uh where here you uh Federate the query from dgdb and you push it to pogress uh but you can do the other way around uh to offload an etical workload to DB and as you mentioned um yeah no context switch without different CLI you do everything here and talk to your P I want to come back uh and maybe because
50:21we're already running um closing to to the end and a bit of closing F so we you mention on this use case so PG routing which is you know another uh extension
50:35uh to basically for for routing uh functionality yeah so here my question is like where do you see the ddb special extension evolving because it seems like there is so much available from POS um in my opinion it seems unreasonable to think that the GB can you know cover all those things but where do you see where what is the one
50:58thing that you're still missing today in dgdb that you would like to see in geod dat world yeah I think you're spot on in saying that you know it is unreasonable to say that it will you know get to that parody of of postgis or anything like that um I think if like if I have my wish list of things I would love to see
51:16um you know there's geop par I'm kind of reading in Via hack like I have to take the binary and then do that St as wkb that we we kind of showed um so having some of that native support and now Geo par is supporting a Geo Arrow within that so being able to go from you know
51:32file Arrow duct DB you know kind of streamlining that will definitely kind of add that extra level of things there and then the last is probably like supporting you know like maybe if there's and I don't know if there's a good way to do this but projections because a lot of data is stored in that it's it's a projection number called
51:484326 so like that you know in the first one it's a decimal system right but it's not meters and usually you measure things in meters not decimal degrees you know across the Earth so you know there's little things like that that you know can be addressed by other things but just having like a spatial olap you
52:06know that can that can scale up in in different ways you know there's you know you have it locally you can go serverless with mother duck you can go into python I think that's just a massive Leap Forward and then you know there's just those little pieces to kind of you know get you know connectivity all across the line but I think it has a
52:25lot and I think one thing also really interested on is the native geometry type so there's a geometry type data type support in duct DB and then there's another one that's called like Point 2D line 3D those those are super fast and and there hasn't been like a lot of work on the data type of geometry for a long
52:44time so it's very exciting to see where that would go and I'm I'm curious to see how that pushes forward and progresses you know in in addition to some of the other file things so that's the two pieces I would say is like you know geop par and then seeing where the the native geometry types go yeah um so I I don't
53:03know if Max is uh is uh is watching this because it's one of uh you know the big contributor of the special extension but you heard the wish list but to uh recap better support for ingestion of you know modern file like geop Parke without a small work around and then then projections uh but then at the end you
53:26still see as as we talked earlier on this session buz as a you know a key element for for other count of of things like you mentioned routing for example does that make sense like as a trend for you yeah yeah I mean I think I think there's you know like post post gists and and post crust are always going to
53:46kind of be there in some form especially for the things that they do well with but you know like I said I think you know having that that easy coexistence like you said either from Duck DB into post crust vice versa you know as a duck DB you know foreign data rapper that that is so you know each thing can do
54:03what it does really well you know ducks and elephants can be can be really good friends so so max just reply if you can see the comment that is in the chat so I hear you um great um thank you so much I
54:19think we we went a bit fast on the second part of the coding but don't worry don't worry we'll be back uh we proper blog post probably and and yeah I like to do a dedicated video on spal data because I learned a couple of things and I was pretty amazed like how easy it was because the last thing last
54:38time I checked like I would say five years ago uh we trying to do things with leflet and so on the process like was much harder was involving still pogress and so on and I think here there is a couple of things that's being streamlined that's I think the the conclusion of this session through thanks to ddb and the special extension
55:00so that's great and the interoperability with post make it really uh nice to fit directly into that ecosystem uh Matt where where can we found you uh and probably I should uh
55:14mention your YouTube channel but what else do you want to to promote this that's your pit time because you spend a lot of time with us today yeah so um yeah I'm on I'm on YouTube at M for so I I I post videos regularly there on topics around geospatial and and do some live streams with some different coding
55:33webinars I'm trying to do one like a kind of a weekly thing on different topics and then the other place is on LinkedIn so um you just find me if you go to my YouTube you can find the link there um and then I I post stuff pretty regularly there as well um so you can find me in in those places um you know
55:49uh you know like I said I work for Ricardo so you know we do we're doing a lot kind of in the spatial you know scale and and spatial analytics there and then if you're really interested in this you know I got a book here so if you want to guide deep into spatial SQL um you can check that out and there's
56:04links on my different places on the internet that you can find that too so cool uh I'll make sure to share all the links uh in the description of the YouTube video but uh thanks again uh for
56:18everybody else we do uh Quack and code session every other week on the wedness day and in two weeks I have actually finally someone from the dub team uh talking about uh CSV and I know it's a nightmare uh for a lot of data Engineers so we're going to see how Doug DB actually uh do a lot of things under the
56:43hood uh to make the parsing of CSV a
56:48blessing uh I'll see you around and have a good evening cheers
Related Videos

1:00:53
2024-09-24
A new paradigm for data visualization with just SQL + Markdown
Come to this Quack&Code where Mehdi will discuss data visualization with DuckDB/MotherDuck, specifically focusing on Evidence! Archie, who is building evidence.dev, will join us to share his wisdom on charts ;-)
BI & Visualization
YouTube
Quack & Code

50:57
2024-08-21
Moving Forward from Ad Hoc Reports with DuckDB and MotherDuck
Learn about ad hoc reporting with DuckDB from experts Mehdi Ouazza and Jacob Matson, creator of the MDS-in-a-Box.
YouTube
BI & Visualization
Quack & Code

1:02:10
2024-07-24
Leveraging DuckDB and Delta Lake Together
Quack & Code on table formats and Delta Lake with Holly Smith from Databricks. Can you use DuckDB and MotherDuck to query your data lakehouse based on Delta Lake? Yes! And Delta Lake provides a better format than Parquet.
YouTube
Sources
Quack & Code

