Panel: Fixing the Data Engineering Lifecycle; Coalesce 2023

2023/10/27Featuring:

This panel explores the MDS-in-a-box pattern, which has been a game changer for applying software engineering principles to local data development.

Transcript

0:01uh good morning everyone I think we'll get started on this panel thank you for joining us this morning on the panel on the data engineering life cycle I am Louis D larit I'm uh I'm at KES with a cter do it's a data catalog that is disrupting data management through Ai and I'm also the host of the data couch

0:22podcast a podcast that is trying to simplify a lot of the data conversations um I'll start by introducing our three uh great speakers um I'll start with um Matthew who's the CTO of ternary data and also co-author of the book uh fundamentals of data engineering which you probably have all uh heard of or read um and uh we we also

0:47have Medi who's a developer Advocate at the mother dock a severals analytics platform powered by dock DB you also might have heard of mother dock it made a lot of noise in the data community in the past few months and then we have s who's the solution engineers at data fold uh data fold is an automated testing platform for data

1:07Engineers uh so today the goal is to dive deep in the data engineering life cycle but before we do that I like to set the context and set the stage so we are all on the same page uh with this discussion so as defined in the fundamentals of data engineering which is co-authored by um Matthew and Joe

1:29race the data engineering life cycle is a structured Journey that takes raw data to valuable and products uh the different stage of the life cycle include data generation storage ingestion transformation and serving it's also tightly intertwined with important elements like security data management uh and orchestration uh math you have co-authored the book so you are the best person to elaborate a little

2:00bit on this definition could you provide some more context for our audience this morning yeah everyone hear me yeah so there are a few things here um when Joe and I were writing the book one thing we really tried to emphasize is Concepts that had stood the test of time already and that would stand the test of time in the

2:22future and so I think there had been a tendency in the era of pad dup maybe ending four or five years ago to say that data engineering was a particular technology so data engineering is spark or data engineering is the Hadoop file system and then what we realized is that technology evolved very very fast after that and we saw a proliferation and a

2:44diversity of new F services and open source Frameworks emerge and so when you're defining data engineering it's really uh important to focus on the job of data engineering on what you're trying to accomplish and you need to figure out what the job is first and then figure out what the Technologies are and in the past we' kind of gotten

3:01that cycle backwards and so that was our real emphasis and hopefully I mean there are things that I would probably change in the book in a second edition now but I think the overall concepts are still there and hopefully will hold for another decade at least as things evolve and as the cloud changes and as other Trends come and go okay uh thank you uh

3:21thank you so much Matt um the the this conversation is stemming from the fact that the data engineering life cycle is at an exciting uh Crossroad um so we've been increasingly incorporating software engineering principle into Data engineering practices uh which was necessary to navigate the complex data landscape that we evolve in uh but now if we take a moment to rewind a little

3:49bit uh what are the challenge and the gaps in the data engineering life cycle that brought the industry to this juncture where we start to incorporate software engineering principles in the data engineering practice uh so why are we here today discussing the the the solutions and enhancement in the engineering workflow um Matt for this one I'd like to turn to you again uh can

4:15you please share your take on uh what is wrong today with the life cycle of data engineering uh what I say in terms of what's wrong there's actually a positive side of this and then a negative side and so the positive side is that as we moved out of that what I would call the era of Big Data engineering we actually

4:34didn't lose the Big Data part it's just that big data became the default like you could do big data in almost any tool that was off the shelf and we saw a big democratization in data engineering and suddenly it became much much easier so now I could just go to the cloud console and I could have big query or snowflake

4:51or a number of different tools running in my browser in seconds and I could run queries and that was fantastic it meant that the uses of data could spread to many many more organizations and data became much more accessible but the downside of that is that with democratization you need a lot more responsibility to go with that and we're

5:09still working on that part and so I think fundamentally that's what conferences like this are about is like training and best practices and getting better at our jobs and I think we're still we've seen data Stacks deployed in many many more organizations and so we're now in that process of like catching up on best practices and training and such and doing things like

5:28managing cost costs and then maybe maturing into these amazing tools that we have at our disposal now that's kind of my take on it I think I have not take on this uh if feel with the general Del

5:41delivery pce of data teams have been slowing down recently um and I think it's because of the increased complexity of the modern data stack and the demands and so we reach a point I think why we are asking the question now that that this slown in s start to be the new Norm I guess many of you like felt like but tickets in

6:05jira for a data team and waight a week sometimes a month and it's just normal to have uh that Waits longer for for a

6:15feedback from the data teams and I think there is nothing to blame on that because it's kind of a normal uh way of doing things we build tools and we increase complexity with that and so I think one Big Challenge that we have to go now is simplify the the stack but also coming back to the developer experience because

6:39that's become a bit a backseat and an after fa and we got that also in the keynote right with the with the developer experience uh and so I think that's that's something we need to take care more of that what do you think uh yeah I I think to your guys's points like big data has become such a

6:56default I remember starting off my career in 2014 and I thought fast looked like taking a 5-hour query and turning into 1 hour and I was so proud to brag about that and now we're at a place where I think our Collective imaginations like have some breathing room because things are fast enough DBT has made things ergonomic enough we're

7:15like you're answering these anecdotal questions that are still really tough to answer well some of them I'm going to list off hey what's historical performance on this and am I beating it hey how often does this fail in production who uses this model and how often and I'm adding all these new models is it worth the money to run this

7:30in production every day and you know as all those questions you're probably imagine you're in a dialogue like oh that's about two to three clicks a couple tabs and you maybe do that like 10 times in a row if you're gung-ho and or type A like me but after a while you grow numb to it and you're just like I'm

7:44going to kind of quiet those questions down because it's not worth the friction to do so and I think we're getting to a certain point where it's like all right it is worth the friction to do some of these things or maybe someone will create a solution to just melt away all the toil in answering those questions

7:58and I just get to be a wor of my work and so yeah to your point I think it's going to be a lot more about developer experience and just like making me feel like the hero of my own data story so yeah that's my take thank you for these three hot takes um but enough yeah

8:18scorching um but yeah enough of uh problems maybe we we we can talk about the solutions to the frictions in the data engineering um life cycle so now like to explore Matt you mentioned the the technologies that are supporting us in reducing this friction now we love to explore the technical landscape that is Bridging the gaps in the life cycle of

8:41data management so the data engineering field is evolving uh super quick we have new tools uh every every year uh that are that are popping uh the the concept especially of modern data stack in a box has recently caught the attention as a game Game Changer that would enable more efficient localized data development and

9:06management uh so I would like to to ask what are the current Solutions or emerging technologies that have been introduced to improve the the life cycle uh of data engineering uh S I would like to get your take on that uh what are the the solutions and emerging technologies that you see uh are starting to transform the life cycle beautiful all

9:26right I'm going to ease you into my answer before I get a little woo woo at the end all right um I think what I love about this question is that it comes at such a juicy time where I think our collective in dialogue is saying like it's 2023 things should be so much faster and cheaper and ergonomic than

9:42they feel right now as the standard just because like I think I think a lot of us has played at least to some degree with things like duct DB or some data scientist from like polers and other tools and like I've seen YouTube Tech influencers come in your Tik Tok feeds of just like look how ergonomic my setup

9:57is and you're just like hey like how come that's not my reality right now right and you know addressing a lot of these things I like categorize them into like three themes of like speed predictability and interoperability you know some of the current and emerging things I'm seeing is like ducky be and polers here and notice everything I name

10:14out loud is open source and you can literally use today and so that's intentional um and I think a big part of it is like our laptops are a lot more powerful than we think I still remember running my first stuck DB SQL query and it broke this anchoring Point within me because I thought fast look like taking

10:30a DBT run from 5 minutes to like 1 minute and just bragging about that but now I literally prove this out you can run 28 SQL operations with duck DB and DBT and88 seconds I think that's 880 milliseconds and I'm just like oh my gosh like why isn't this like more normal to how I work because it's faster

10:49and just free not even just cheaper it's just free on my machine maybe I pay like two cents on my electrical utility bill for that I think the other thing is you know poers where it's like think like hand Do's built in Rust where I think it's opened the imagination of like data scientists and people that want to live

11:04in that data frame Centric kind of workflow and go like oh like my jupyter notebook can do a lot more than I think it should uh at this point and I think it's creating this Collective momentum of like oh like anytime I work in Dev it should just be for free like full stop just because there's so much like battle

11:23testing out there so many random blogs so many like YouTube influencers that are just showing us like oh like this is possible this wasn't just like some one nishe POC okay so there's predictability um two things I want to mention are like data diff and Apachi iburg you know I used to work at DBT for two years before

11:39I joined data fold you know a big reason I chose to work at data fold is because you know they were solving a problem that deeply matters to me and that 100% matters to all of you and that's less boring toil ad hoc sequel in my life I know all of you have like like put some filthy ad hoc joints together to you

11:58know differenti between Dev and prod some of you have just run like some simple like select count star all from I just want to you know eyeball some row counts here and like you probably do that 20 times over it's like the open dirty secret all of us do but none of us you know none of us are really proud to

12:13brag about that and you know I think what what's so Charming about why I join data is because we you know open source of free utility called Data diff where essentially like we just melt away all that ad SQL where it's like hey like literally compare your DBT models in Dev to prod here's all the stum sets here's

12:27the data type changes move on with your life you don't need to set up DBT audit helper I know that's a pain in the butt for a lot of you that want to set that up um and you know the second piece is Apachi Iceberg which is an Open Table file format and I think honestly like I'm sure you folks have like seen the

12:44the the wrap sheet there but just one thing is like scheme Evolution where like I know all of us have felt devastated when you're like damn I have to run DBT run- Das full refresh because I messed up something and how I track you know the schema evolution of this thing or it's it's like I thought a pend

13:00only would work and I could just you know numb myself out to other edge cases but you know like I want to live in a world where like we don't have to care about schema Evolution because some tech is doing the accounting behind the scenes right and now this is the woo woo part all right and this is mechanically

13:14possible today and so hear me out before you write it off completely so I saw murmur of this in the DBT public slack and talking with other like practitioners but imagine running DBT against duck DB for free in your development environment and then when you need to push it to prod and things like snowflake or data bricks you use

13:33something called SQL got to transpile that duct DB flavored SQL into that cloud respective flavored Sequel and then if you want to like have a sanity check run data diff to compare against the two and maybe throw in a Pache iceberg in the mix if you want to have a little fun and test out schema Evolution for your incremental runs and what's so

13:50powerful about this is like what does this sound like it sounds like spiritually Docker and kubernetes to degree right and I think what's so beautiful about this is like you know there's already a battle tested blueprint you know in our hands and I think we get to copy and paste a lot of cool stuff from that all right that's my

14:10Spiel I'm a big fan I'm a big fan of uh of uh of D ddb I'm a bit biased because mother deck is ddb in the cloud and I joined them in February but one thing I I really strongly agree with you is like we think in terms of minutes of duration of jobs and that's not normal you're

14:28smart smartphone is like what used the compu be the used computer spec like you know 5 10 years ago so we should think in term of second right so that's like the example you you give and so like the if you think in terms of of second then we need to rethink uh kind of the the full

14:49development cycle so it's not just about DBT but it's really everything from end to end and how does that look like where there is solution as you explained with with ddb but how to leverage the the the local compute and not just like pushing things to the cloud I think the next thing is that instead having a local

15:10client you have a client which is kind of acting as a node of your cluster which is in the cloud and so it has in concerts with both your local comput and your Cloud compute so one one example I can give you there is two technology I'm really um looking for is wasm so wasm is

15:31the ability how many people knows wasm have heard about wasm okay not so manyy it's it's a data audience but in software engineering there it's really a Hot Topic because basically it enables you to run uh different kind of code in the browser R python or whatsoever really efficiency really close to a desktop app so you can have really

15:53impressive browser app experience without installing anything okay so that's one thing thing the other thing is a web GPU who has heard about web GPU okay few people the same from

16:08was um so web GPU is a web standard that is popping up and being supported now earlier this year by Chrome and what enables you is basically it's a protocol to tap directly into the GPU of your machine from the browser so your browser can leverage your local GPU great going to have amazing games directly in the

16:31browser without installing anything yes coming back to data where do we need GPU come on where do we need GPU machine learning yeah model training so you can imagine an experience where basically you're opening a browser you have all your development setup in Python whatsoever and you're training your model locally and then when you need to push things out to the cloud you just

16:57you know it's act as a concert where maybe you have put some part of the training locally and the rest is going to be on the cloud and for sure there is question about data movement erress Andress but I think those are less concerned because Network bandwidth have been getting much better and also internal also so there is much more

17:16experience basically to act more than just a client that send SQL uh over a cloud service and so coming back to duck DB so that's what we are um aiming for with Moder duck it's basically you have dug DB which can process locally and you can do hybrid queries where you can join data locally and on the clouds and it's

17:36leverage your local compute but dgdb can also run in the browser with wasm so you can have your uh basically analytics environment just in the browser but it's running locally right it's running locally it's not like just a web server which is a gupter notebook on the cloud it's liage local compute so less cost on the cloud and we finally get to use

17:59properly or expensive Macbook so uh that's my take what what's your take on on this uh Matthew so I I think the other thing I'm seeing which fits in with the themes that you both have brought up is that we're seeing a real maturity of orchestration and DBT is part of that as our airf flow itself was

18:18kind of the first you know modern python or orchestration platform has gotten much more mature much more powerful you have these various other competitors like prefect and Mage and Daxter that offer all kinds of new features for orchestration and when you combine that with uh development environments that use things like polar or uh like polar or duct DB and you now

18:43have a development environment where you can seamlessly move from your local machine into a containerized environment and basically process data the same way and you can take these containers that you develop on locally and scale them up vertically and actually process quite a bit of data without going to a cluster environment and I think that's quite exciting from a data engineering

19:02perspective I mean I think for the vast majority of workloads that type of uh fairly seamless deployment environment will meet almost all our needs with us occasionally having to go into big data and use true cluster services like spark but I think this type of deployment model we're going to see more and more in the near future with with just the

19:22matur maturation of multiple Technologies in that regard yeah uh thank you so much for for covering the uh there's a lot of of tools here that can really help um Bridge the gaps in the data engineering life cycle so thank you for going over so many of them um a good breath of of tool here uh but now

19:42um I think now that we've talked about tools uh let's talk about the human side of the data engineering life cycle uh so recently we've seen not just changes in the tooling and methods as we just uh talked about but also in the roles that people play on our data teams and on our Al analytics team when you think about

20:04it the roles of analytics engineer uh is super new uh and it's also new that data Engineers are getting more comfortable and embracing uh software engineering techniques um so what does it all mean for how we work together and how our

20:25data teams function uh and the skill that we need to to make our data teams work um uh work well and for the data Engineers uh how are their role set to change as they Embrace more the software engineering practice uh what does it all mean for the orchestration of the data engineering life cycle uh Med for this

20:46one I'd like to turn to you and how this uh shift and role we will create a ripple effect on on data teams and and business teams Al together yeah uh I've touched this topic uh quite a lot on my uh content on YouTube I felt a bit attacked like this YouTuber influencer yeah yeah you're one of them

21:06number one baby oh man uh but anyway I think there is there is two first we need to acknowledge that uh to recognize that there is an issue with the role definition in data so it's a it's a big mess at the moment and the reason is is that our job title uh remains constant like data engineer has

21:27been there for like the past seven eight years I mean before it was called something else but our responsibility and task have been changing those past years and even exploded if you're a data

21:41engineer specifically because you're at the center and the other side of thing is that companies uh adopt new technologies at different pace and they also you know have different interpretation then of what a data engineer should be doing you know in their company so just like search to data engineer data engineer job offers and you're going to see it's wild west

22:07out there it's a bit scary you can see some people asking for dashboard other from like backend Java stuff um so that's that's because of this thing where our responsibility has evolved with the job title mostly remain uh constant so now that we know that um and

22:27we take data engineer with sort of grand and so one of my blog and video is stop using the term data engineer so what else can we use because we still need kind of a role title to you know quickly speak about responsibility that we are doing there is the data platform engineer that we've been seeing emerging

22:46uh so basically before going actually to this Ro one easy a way that I I see how to uh

22:54see how the role definition is going to evolve and how it is today is to see what has become common uh commodity in the data engineering uh workspace so what are the test that we used to do seven eight years you know ago that are now available off the shelf so the first thing has been also mentioning in the

23:14keynote it's like you know cloud and server management so you don't have to set up yourself or your own cluster your own cloud for for Ado you can just with a click or an API call uh create that cluster so server management goes away but then so what kind of responsibility take take it back because you have now

23:34more free time you could reduce your work time that's also an option um but you can basically go at a higher level instead of managing server you manage infrastructure and you manage framework infrastructure to give the possibility of other people to launch their cluster with a click right and so that's where data platform engineer have been involving is that enables not only

23:59internally not just managing One internal data stack but also you know giving framework so that other people can Spa their own data stack right so that's where infrastructure has a code come in ETC so those data platform engineer uh going strongly especially you know in scale up and bigger companies to enable other people basically to own their infrastructures

24:22also and the other s uh that has been evolving is everything related to develop pipeline so you were mentioning that with SQL and the cloud data warehouse so like 10 years ago if you wanted to write B job you had to know Java and do my Produce job and that was really painful and even spark back then

24:42was really painful and so now the technical barrier to entry has been lower and so anyone can write a big data pipeline in SQL okay so you don't need you you still need to know you know the internals and some fundamentals it's really important but it's not a requirement to build the pipeline right so that was again a task that a data

25:04engineer used to do develop in code and understand you know distributed computes no that goes away and basically what's leftt for new space is basically understanding more the business side of thing because when I was working on on a big data cluster mostly I get the requirement from business and I mostly implement the pipeline because I don't

25:25have time to build this business knowledge or so much things going on around the distributed compute but now there is the you know emergence of the analytics engineer which is closer to the business um and so this is where basically the two roles I can see that played like we don't have kind of the traditional data engineer and it's still

25:43exist again because it depends on uh how fast your company adopt new technology and your process so that doesn't mean that the traditional data engineer disappearing but on average it's splitting uh towards uh data platform Eng engineer and etics engineer and again if you want to see where it's going to split more you just need to see

26:03what's going to become command commodity in the data engineering landscape and what we have been seeing a lot lately is like writing SQL we don't even need to write SQL anymore data breaks announced like the English SDK where basically you say what you need and it's translated into a SQL query and so many uh partners and a cloud provider including murder de

26:25actually is just going into that train so again you can think okay if we are not spending time to write SQL where are we going to spend our time next what's what's our our next responsibility to just be more productive you I think let let me build on that to say that I I think the human skills are probably probably need to be

26:46our top Focus as data Engineers whether it's being on the platform side or the analytics side I think that's really where there's an opportunity to improve what we contribute to the business and in the devop world there's this idea of moving left and the fundamental idea is if you were to draw a diagram of of devops on the one side on the left you

27:04have software development and on the right you have the Ops side and traditionally those were very separated and the idea of the whole devops movement is that the Ops side is supposed to get much more involved on the left side of the diagram they're supposed to connect with people building the software and get involved very early rather than treating it as like okay I

27:21write the code and I throw it over the fence and now it's someone else's problem and if I'm Ops I tell the software Developers the problem is but I don't help them to fix it we're supposed to want to move toward collaborating and mov working together to fix things and I think we have very similar opportunities in data engineering um specifically the

27:40traditional division of labor especially in the days when we called the data Engineers ETL developers or database administrators they were basically a type of data engineer was to throw it over-the-wall approach right so developers would write code they generate data in some crazy or schema and then it was the data Engineers or the ETL developers job to try to

28:00untangle the mess that they got and I think now as data Engineers we have the opportunity to move left and act actually collaborate with the developers who are generating the data so that we can build analytics applications from the get-go right into the code and we can have a vision of what um analytics we want to have inside our SAS platforms

28:21and our applications and what business the business side of the house might want to do with that data right away rather than trying to untangle a big uh big mess um I think there's another layer where we can think about moving left and that is if I'm thinking from the perspective of individual contributors and so analysts data

28:40scientists machine learning Engineers they're not all ic's they work in various larger teams but often they're treated as simply Downstream of data engineering and the data Engineers just sort of throw the data over the wall to those teams and so as an analyst or as a data scientist you can move left by communicating more with data engineering

29:01and being involved in the pipeline development process early so you get exactly what you need and you can realize New Opportunities opportunities with data rather than just simply consuming what's given to you so that's kind of my perspective I think there are many other skill development uh opportunities as well but if you can build those human skills and then build

29:20your technical skills around those that's where you're really going to move your career forward cool I think those are pretty good I think think like it it speaks to something where it's less about oh I'm a data engineer and I know this particular tool it's just like I am just a problem solver and I happen to use data as that attack vector and I

29:39think we're seeing that more and more um because never has it felt more mainstream that like you can make money selling data and Licensing data we have clear examples of that with Reddit monetizing the API we've you know with like authors suing open AI for using their books you know to store in their training data sets like we're seeing

30:01just like it just so front and center in in the mainstream Zeitgeist that I think it like increases you know the surface area of what like ownership and pressure looks like for data Engineers um in particular it takes a lot of like nice to have of like oh I guess testing is nice to have oh I guess observability oh

30:18I guess slas are nice to have to like oh no if we don't have these things like we don't make money as a company right and like there's that ratchet up and I think as an addition to that it's like it's curious to see what are the derived incentives afterwards of like okay who are the people I need to talk to like

30:33speaking to the human skills part of like oh I can't just be like oh look how cool my code and data is it's like oh like I have to tell a story about like oh is this clean data set when I talk to accountant or the CFO oh this will make you money this will save you money

30:46here's direct causal impact of how this data helps you tell that story in addition to like talking to software Engineers because you need apis and a UI on top of that data in order to show that and I think what's even more exciting is that I think we all get paid a lot more as a result and like I've

31:02seen data engineering job posting specifically for companies that like sell proprietary data or something go all the way to like Fang levels of like 2 300K you know and that's just Bas and I think it's just it's so exciting to see um that our field is taken that much more seriously and people are starting to see us go from cost centers to

31:21revenue drivers so I'm excited for that world and singing what incentives uh that draws out of people yeah and what one comment on what you said about like I mean The Human Side of thing is super important was always important but I think it's getting better because we are lowering the technical barrier so I was as I was saying it's like we enable

31:41other people to do more complex task uh because before when I was writing a Java Pipeline and the business say yeah but actually can you change that filter yeah good luck go you know fix my Java code like impossible to read if you're not a Java developer right so so now like it's only you know only need to know SQL and

32:00as I said like tomorrow maybe you don't need to know SQL um so I think we're in exciting time where because this technical barrier to entry is getting you know lower than the language the based language requirements is you know to meet each other from the technical side and the business is getting easier to do as I as I can see because I speak

32:21more SQL as interface with the business Than I Used to Be wow you three are making me want to transition to data engineering now we can make that happen you talked about a lower Technical burer and a higher salary I'm I'm in for that Wombo Combo

32:40baby um okay well a good panel um always

32:45has some controversy to it uh so I I wanted to to ask your thoughts on on a statement that we've heard whispered in the corridors of KS should all data Engineers become software Engineers um should we merge the skill set should we blur uh the difference between the the two roles uh S I know you made a YouTube video on the on the

33:10topic I didn't watch it uh but

33:14okay I appreciate the H wow being controvertial also on your side but if you if you could um yeah give us your your thought on that controversial topics um I think we'd all love it yeah I I think it's already existed for a couple years whether we like it or not um particularly like in F companies or

33:34like big Tech where like you've probably seen on LinkedIn software engineer comma data software engineer comma data platform or like if they don't want to say data engineer because they're afraid of like being pitchon Hill data infrastructure engineer data platform engineer um and it's like one of those things were like I think one of the big distinctions of like data engineer

33:54software engineer I guess from like from the front lines in anecdotally is just like whether like um like you make money

34:03or like sell like a customer facing product feels like the tone of it at times uh but I think like data Engineers I think by default have to in order to like own and live up to the responsibilities of like actually selling data to like everyday people like you and me where you essentially have to build tooling like you've

34:19noticed some people building like random data tools and Rust part of it is for you know resume hype but another part of it is like they want things to go like super super duper fast right and I've seen you know some companies I remember like talking with like Netflix a couple years ago they forked their own version

34:34of spark and parket and I was like good God like that sounds like so much work but it's because they see the need to adopt some of like we need to build our own tools in order to be better data engineers and I think I think a lot of us are going to organically see some of that I mean a lot of you have probably

34:48built like DBT packages that are internal or public a lot of you have built like little command line Utilities in order to do your job and like those are you know the design behaviors and you know things and incentive structures that software Engineers live and breathe every day so there already are is my my

35:07answer so my take on this question is it depends I guess any controversial answer comes down to it depends but uh one of the common issues I've seen in for software Engineers transitioning to data engineering is that if their background is in like building fairly simple services that are just like column response API services then they'll tend

35:27to try to deploy the same approach to data and they'll tend to try to process one record at a time which really doesn't work as anyone who's worked with data knows you've got to think in a totally different like parallel mindset which actually SQL is extremely good for SQL is a set theoretic language uh but I I think at the same time um with the

35:48strong background you have in software engineering with a little bit of retraining you can become a very strong data engineer and you can move up that chain of abstraction to think of about data in a different way rather than SLE records as something you process in parallel using these tools that mostly do a lot of the work for you behind the

36:05scenes so you can think about the data itself now there's another angle on this which is we're seeing more and more Frameworks which kind of take away a lot of the the work of moving data out of applications into analytics where you basically can just hook your application right into a tool in the cloud and you immediately get a dashboard or you

36:25immediately get a table off of the stream of events coming out of your application and I think that's the the really powerful trend for merging these professions where I as a software engineer will just be able to think about my application of course I want to think about how I'm designing the data events that I'm generating so they're

36:42good for the type of analytics that I want to create but frankly I I hate to say it but I think that will eliminate a lot of data engineering jobs if I can just run a uh create a table right on top of my application events and immediately have a dashboard in my SAS product that I'm building as a brand new

36:58startup of course big companies will still need data engineering but that's going to make the job a lot easier where I immediately can have customer value coming out of analytics right off of this event stream yeah I think this is already happening right we have seen uh some some responsibility as I was saying earlier just shifting away because new

37:17product are arriving but I feel for me it's more like data engineer is a kind of a specialty of a software engineer so they need kind of the same foundation and we've seen that that there is a reason why again I'm bringing that up but at the keyn note there was a developer software engineer life cycle right and so we need to get inspiration

37:39from that and actually Joe R uh post sometimes on on LinkedIn if you want to know what's happening next in data look at software engineering and I think there is a lot to take and we are taking it a lot we're like versioning so DBT brought you know broughts out best practice of software engineer ER to version SQL and that's that's that's

37:59like Basics and normal for any software engineer but to you know version uh p verion on SQL that's still pretty new for data teams right before it was just well we run it there and sometimes some backing people for maintenance or other reason they have like scheduled queries but anyway so there is that the cicd stuff same thing that's Foundation

38:21coming from software engineer but as you said also like not everything is the same working in data so but I think we need to get inspiration still for like um the development life cycle I think if you look at how you can build a website today it's pretty easy there is like you know standards probing up like you start

38:41a react application and you have really quick feedback regarding your development Lo uh cycle you don't need any Cloud dependency you write something you know on your JavaScript and you get like what's the page in the website directly building data pipeline is far from responsive that like that and I think we need to go to to to that

39:02experience but again we have the challenge of working with data and we cannot just work with mock data like any frontend developer they don't care about what data they're showing up on their on their website but for a data person yeah we need data and actually working with synthetic data is pretty hard so you need like sometimes production data

39:21because you don't have good staging data so there is a b lot challenge around this but I feel like the foundation remain the same and we can also see on the vendor side so vendors like DBT bringing up you know versioning on SQL and cicd pipeline for projectional SQL pipeline but we've also seen that on the business um business intelligence tool

39:42so now we have dashboard as a code and yeah your dashboard is an asset is a software asset and I think like people need to realize that that if you have an executive looking at a dashboard this should be versioned you should have different environment you should be able to you know quickly roll back to a certain version it's not just a UI thing

40:06where someone you know update or you know refresh a connection and so that's just software Engineers so like working with different environment version and that and some dashboarding tool now provide the ability to to have that to have your your dashboard as a code and embrace the software engineering foundation so again I think we we still have a long way to go to to get there

40:28but for me we're just getting the inspiration of software engineer as a foundation and we apply as a basically specifically for for data engineer yeah um while we have our answer it uh it depends uh but there clearly is something to to take from um software engineering uh principle and bring this in the in the data engineering life cycle um I have one

40:54last question uh for you uh is is I'd like to take the time to think about the future to dream a little bit uh and ask you if you could change one thing about the data engineering life cycle what would it be uh how would it make your life easier uh and what do you think the future holds for the field

41:15uh Matt could I start by having your take on that oh let's see one thing that I could change again I I'll go back to what I I said in my first answer and that is is um I think just more training on best practices and the establishment of more best practices in our industry that are General and not tied to specific tools

41:37will make the discipline much much better it will solve a lot of problems that we're seeing right now like exploding cost like issues with privacy just having more General training and it will make if your colleagues have more training than your job as a data engineer is going to be better and if you have more training your Al your life

41:54is also going to be way better okay s any yeah I I think for me is I hope it just becomes so absurdly normal that we're all selling data and I say that because of all the Deep incentives that will'll derive as a result and whether we like it or not there'll be this inner momentum to go like things

42:16should be a lot faster and cheaper and more ergonomic and I'd be excited to see how all of your imaginations manifest that yeah for me I would just get the fat in term of tooling I agree that like The Human Side of thing is really important but I think we we are just struggling like with the complexity of all the you

42:38know the the integration that we need to maintain and so on and I want like to have like as I said like a better development experience as a web developer I'm building a website locally I push it to the cloud down um where here it's uh it's far more complicated that and I think think the the general

42:57mindset is is going through that and I'm really glad I've seen that at the keynote today um but this is really we should think in term of seconds and not in term of hours and minutes uh for a development life cycle and also for for a data task for for a business that it should be translated into hours or day

43:17not weeks or months yeah H I like all these aspirations uh on this note I think it's a time to close out this uh this panel and um open the floor for some questions uh if you have any any questions it's a time to ask them to our

43:37speakers

43:42yes hi so one thing that's kind of been a theme of your discussion is more and more I guess kind of the technical pieces are being kind of eroded away to the developments of Technology something came to mind was like how the kind of lost start of normalization has completely been removed due to expanding data sources do you feel like there

43:58needs to be an act of concern about protecting or maintaining the value of these kind of lost arts over time or do you think it's not really an issue for the future of the industry that some of these skills just are completely unnecessary anymore I'll speak to normalization specifically um I think part of the problem we've run into with

44:16normalization is that we're still waiting for like the next maybe Great American data book or something that that tells us about data modeling and what I mean by that you books yeah yeah well no no no no not claiming that but what what I mean by this is that we had in the early days of data warehousing we

44:37had Kimble and other methodologies that really carved out and said not only here are the normal forms but actually this is how you should normalize your data in practice to give it meaning and here here are facts and here are dimensions and then we got to the era of columnar databases in Hadoop and with the data Lake we just threw all that out out

44:55without a replacement I remember I was going through some Google training for big query a few years back and in the slides it literally just said like denormalize your data and I'm like what does that even mean like are we saying one big table like what I know roughly speaking like what normalization is is a specific definition but what what does

45:15it mean to denormalize I mean I don't think one big table is the answer I think we need to find a good hybrid approach that works with these new tools but still defines some degree of normalization that also adds meaning to the data and so when I talk about best practices part of it is training and part of it is like

45:32putting more stakes in the ground to say if you're using colum or database if you're using uh parket or big query or snowflake these are best practices they're not absolute but like work within these Frameworks to Define how you're going to model your data and do a degree of normalization without the extreme normalization that we had in the

45:52era of row based databases questions it's perfect answer I have nothing to add to that I don't think I answered the whole question there was the the lost arts we didn't talk about the lost arts uh thank you um my question is I think you talked about if it's fair to say uh issues with like bureaucracy communication that sort of stuff uh

46:17benefits of like new technology stuff like that um I think myy maybe just touched a little bit on like the problem with explosion of our stack so could you talk more about the technical anti patterns maybe one level lower about what y'all see is like concerning yeah so I think I think the you were mentioning like we are need to build our

46:40own tools to speed up things and um and so a lot of like data teams build DBT like tool before that DBT was a thing I did it in scalp I know like Twitter redit did it too um and so I think the

46:56problem over there is that afterwards you end up with a complexity in certain company where you have isolation and not like common standard that kind of do the same thing right and the challenge that if you have building that internally where I build it in clar now it's like it's so big that it's really like almost impossible to get it out and go on the

47:17DBT train even if we would get more features right coming from the community and DBT and so I think one anti I can see is that people tend to build sometimes too fast tooling um you know internally without looking or contributing to what's available open source it's it's kind of like a a normal pattern where you know you have like 10

47:41different people building the same framework and you know like a VC investor and and only one will survive and 10 people are going to jump onto onto that train but I think that's like also I mean a bad side effect where we tend to build too much thing things internally without contributing so I think uh one one key solution over there

48:01is to go more through open source and can more have this mindset of contributing and it's not like it shouldn't be like on your free time because sometimes that that's how it's interpreted oh I contributed you know on open S for fun no that should be part of your work like if you manage to you know have a technology approval within your

48:21company you can spend time contributing to that project you know open source project as part of your your work on PO time so I think this is something we need to see as a Trends I don't know if you have a take yeah yeah I think the big thing is I think it's so easy to pre- optimize look at a couple of medium

48:38tutorials and go like I guess this is my tech stack now right I guess this is how I'm supposed to do my job versus when you take in that more human element starting at like first principles of just like what are the fundamental problems I'm solving for is it simply hey what's Revenue by month okay I probably don't need Sparks

48:55streaming in my tech stack right hey like I my team we're only going to hire for like analytics Engineers we we don't want to you know maintain a roster of data Engineers or like data analysts like that changes how you make your decisions and so like trying to find that right level of like challenge to skill set to how much like outcome you

49:17need to drive like that is a better way to make your decision-making process to go like oh okay this makes sense to go with DB core vers Cloud oh this makes sense to choose you know airflow Daxter prefect what have you right and because what matters at the end of the day is that you know Tech is not our Salvation

49:34people are right and so like what what's important is like do you enjoy working with these tools every day or are you just doing this for a resume are you just doing this because you saw a Blog and you're insecure and you're afraid to carve your own path and like it's okay cuz like there's things are fast enough

49:50and cheap enough where you can fail forward all right and not feel like you have to be attached to to a particular doctrinal Tech you know stack so

50:02yeah oh okay uh so I guess I was just wondering talking about data engineering standards and normalization and all this stuff how would you measure the performance of your existing data engineering standards and how often would you revise them that's a good question that's a hard question do you McKinzie answer or R I'm just Kidd uh I think I think what I've seen

50:31like and I've been pushing also internally is all kind of like Community or tribe at different level so it could be a data engineer tribe or you know iics even more Niche and itics Engineering tribe and rather than saying you know we revisit or we do kind of a training every you know quarter but having more it weekly and like part of

50:54you know your your work basically learning is part of your work and keeping you know up to date with things coming in is also part of your work and it's a lot given like how fast things are changing right so what I've seen is like internally in in companies is there is you know a rotation of people

51:13presenting something and so on and then the discussion triggers and from there usually sometimes say oh but we we we figure out like we we like training on modeling maybe we should bring that up but what I've seen the trigger is always like kind of a like a weekly ritual of of sharing things or uh watching what's

51:34what's what's going on and then that triggers the the the need to do a

51:41specific train uh training to to fill the Gap yeah that's great and what what I'll add to that is I think part of what we need to do is is look at where our profession is being criticized for failing right now things like like out ofc control costs uh poor management of private data um exploding model complexity and then if we can evaluate

52:04those things and Target our training standards toward those particular issues that are top of Mind issues right now that's that's that's the starting point and then keep those standards up to date as new technology changes come out and just keep evaluating I don't know that's off the top of my head that's what I'm thinking of but that's a very good

52:22question I'll have to think about it more hey guys uh Lindsay from sakota uh you were speakers at MDS Fest so thank you very much um I want to follow on from that question a little bit um if you were to build think about building a training program for data Engineers what what would you include in

52:44that you all think [Music] that okay I I I would look like if you

52:52look at the Modern data stag there is uh specific specific layers that is important which is ingestion transformation and like activation which is Data Warehouse bi and then you have a baseline layer with like you know devop cicd and orchestration so I think if you Orient your training around those four layers uh that's the the most important

53:14I would say it's like that's cover basically uh from end to end now what are you going to put into those thing that's like a big question because there is so many tools I think like we we we shouldn't forget about foundation so I would start with the basic to be like technology agnostic um if you if you talk about SQL

53:35and if you pick a technology for that um and then next maybe try I think it doesn't matter if you take like a fresh new tools or um let's say something that's really already present in a lot of companies as long as those Foundation are there so for example if you use uh a

53:55SQL uh tool I mean basically to run your

53:59uh your SQL pipeline like DBT or the competitor I think it doesn't matter as long as you understand what dbd does behind the scene and what's the goal of it right and I think that's the most important part of the training and then to say hey by the way we're going to use that tool because we are not going to

54:14build that ourself and there is other tools available there you know I so I Mentor a couple of people at a time I was entering eight people at the same time and I learned something where uh especially in this job market portfolios and those training programs only go so far because hiring managers sniff out really quick oh you

54:36have good academic Theory but you don't know what it's actually like to do anything and so like I've pivoted my approach to training people to be more just like find a problem that you can use data with to solve at your job whether you're a data analyst or a project manager um or a manager that's been out of the

54:57technical Loop and wants to be an IC again and then like oh okay that's the problem okay now let's reverse engineer how to solve that oh you probably need a DPT in there or you need a GitHub in I mean one of my mentees my first thing was like convince your boss to buy GitHub like that's step one right and

55:13like and I think what's powerful about that is they realize like oh it's more than just like clicking sign up and getting my username and like making commit and seeing my read me for the first time is talking with the manager and figuring out what does it take to convince someone with political power that is worth the paperwork time and the

55:28energy and the training and the emotional labor to do that in the first place right and I think that that understanding helps you see like is oh it's more than just hey is airflow or some other orchestrator worth the technical lift is it worth the emotional labor do I have the social capital to even make this happen if I do this am I

55:46willing to spend the time role modeling this for the company Gathering some random Youtube videos and regurgitating that content for the rest of my team right and I think that serves as a much better you know element for just like actually learning because you like feel viscerally what like is possible and then two like when you're like

56:04interviewing for jobs like people can feel it in your body L of like oh this person has lived through things I see their Battle Scars and I know this is more than just academic Theory or a medium tutorial that they read right and what I'll add to that is I think the these are great answers I I think there

56:21still is a place for the kind of backend academic Theory and uh one thing I see a lot these days is that because the tools are so easy because snowflake is pretty much Plug and Play you just drop your SQL quarion there we've forgotten a lot about how databases work and most of the time it doesn't matter most of the time

56:39you don't have to think about how the database works you just throw your SQL query in there and it runs but then when your snowflake costs start exploding it's quite often because you don't know how to how a column or database works and you're running a query that was designed for like an index database on a colum or database and maybe not cluster

56:55ing properly and so that's where there still really is a place for some Theory and I think that theory should be presented in a fairly General way not tied to a specific tool but like theory about how indexes work how object storage Works how storage system work column or databases all these things that underpin our technology those

57:13really come in handy when things start to go wrong yeah thank you for your answers I think it's um we don't have time for any more questions I see some people leaving already so uh I will wrap up but you can find all of us right after if you have more question yeah if you have any more

57:29questions you can find all of them right after and if you want to continue the conversation I encourage checking out the resource of our three speakers song made a video um about should software engineer become data engineer um and there is the Matthew's book on the fundamentals of data engineering and I think Betty you'll also have a video uh

57:51on the topic plenty of resources thank you so much for for coming and we'll wrap this up thanks

FAQS

How can DuckDB improve the data engineering development lifecycle?

DuckDB allows near-instant local development by removing the dependency on cloud data warehouses during the development loop. One panelist demonstrated running 28 SQL operations through dbt with DuckDB in just 880 milliseconds, completely free on a local machine. This shifts the development paradigm from minutes-long cloud round-trips to sub-second feedback, similar to how web developers iterate locally before deploying to the cloud. For more on this workflow, see the dbt integration guide.

How is the data engineering role evolving with modern tools?

The panel identified a split in the traditional data engineer role toward two directions: data platform engineers who build infrastructure frameworks so others can manage their own data stacks, and analytics engineers who sit closer to the business. As cloud warehouses and SQL-based tools lower technical barriers, data engineers spend less time on infrastructure management and more on understanding business context and making data self-service.

What is the 'modern data stack in a box' concept discussed at Coalesce 2023?

The concept envisions running your entire data development environment locally using DuckDB for free, then transpiling that SQL to your cloud data warehouse dialect (like Snowflake or Databricks) using tools like SQLGlot when pushing to production. Combined with data diff tools for validation and open table formats like Apache Iceberg for schema evolution, this mirrors the Docker/Kubernetes pattern from software engineering: develop locally, deploy to the cloud confidently.

Why should data engineers think in terms of seconds instead of minutes for pipeline execution?

Modern laptops have specs that rival or exceed cloud warehouse compute tiers. A MacBook with 64 GB RAM and multi-core CPUs surpasses a Snowflake X-Small warehouse. The panel argued that data teams have normalized waiting minutes or even hours for feedback, but tools like DuckDB and Polars prove that most development workloads can complete in seconds locally. This faster iteration loop compounds across teams and improves overall data engineering productivity.