Fixing SQL Spaghetti : Effective Refactoring Techniques

Q: What are best practices for refactoring SQL in dbt projects?

Some well-tested practices include: using an alongside strategy where you duplicate the original model as a "legacy" copy and refactor a separate version for easier QA comparison; converting nested subqueries to CTEs (Common Table Expressions) with descriptive names that explain what each section does; and organizing models into staging, intermediate, and mart layers where staging handles light cleanup, intermediate handles pivots and reusable logic, and marts contain the final business-facing output. Learn more about dbt integration with DuckDB .

2024/06/13Featuring:

Mehdi Ouazza

Lindsay Murphy and Mehdi tackle a big challenge for data professionals: dealing with old SQL scripts. How can we make messy SQL code easier to read and maintain? We'll discuss pragmatics things you can do, and we'll tackle an example.

Transcript

0:29e

0:58e

1:28e

1:58e

2:28e

2:58e

3:28e

3:58e e

4:41hello everybody how are you doing uh my name is m and welcome to another episode of uh quack and code where we do quack and then spend a bit more time on the typing on the keyboard uh coding and today we have uh an interesting topic uh I want to dive in around uh SQL spaghetti I'm sure a lot of um data

5:06folks has done uh has issue with uh with

5:10it and the goal basically today is kind of uh to see what is possible in term of

5:17uh fixing uh SQL not really the language but like long SQL script uh and I have a special guest uh with me today always because I always need uh copilot kid of copilot is not enough I need a new man copilot um and so I have uh

5:39lince which is with me right there welcome L hello thanks for having me on the show yeah we're super happy to have you we have already uh a few comments of

5:52people saying that there worst nightmar is actually uh recursive joints um around sequel um but uh and we

6:02we're going to dive a maybe not recursive joints that other things it's a deep topic but uh first would you like to introduce yourself uh quickly yeah yeah so I'm Lindsay Murphy I'm the head of data at a company called sakota and I'm pretty active on LinkedIn so you might see me posting on there quite a bit I also have a weekly podcast

6:23called womenly Data um yeah I've been working in data for about 13 years now so I was a oh individual contributor for many years yeah and uh moved into leadership roles over the past few years and um my head of data role I'm a data team of one so I set up our internal infrastructure and manage things all all

6:41by myself I have some help from uh from another team member at sakota as well but uh but yeah manage our DBT project build out our infrastructure our bi tool all that good stuff I like that uh data team of f can you can you bit like because you you use that uh that naming and I think it's uh it's really to give

7:00a good Gras on like what kind of task you are doing given the size of the team right uh but can you maybe elaborate a bit more for people that doesn't really picture or people that never been in a team of one right they've always been with uh other people yeah yeah I've actually done it a few times now and

7:18I've done that First Data hire role a few times so usually if you are a team of one you probably are the first higher so it's uh yeah it's a really I think it's a very big opportunistic role like it's definitely a good role for people to grow into and kind of show your value um don't make the mistake I did and go

7:34into that role too early I joined a company where I was the first data hire and I would say I was like mid-stage career at that point and so I found it like it's a little hard because you're kind of alone and if you don't have Community um you're learning things on your own and it can be a bit challenging

7:47so um it's definitely uh a really exciting and interesting role like one day to the next you could be doing something totally different um like choosing which tools you want to use and building your infrastructure versus actually building models and doing insights and all kinds of different things so it's really good to kind of like hone your skills in a full stack

8:05role um but you have to get really good at prioritization you have to get really good at pushing back and you have to also know um you know when do you hit that point where you're not really seeing the forest for the trees anymore and you're like sewing in the weeds that you can't really plan the next you know

8:17three to six months that's when you need to start hiring a team I think yeah and and I think it's uh can can you I'm just curious there uh it's not the main topic but uh how do you manage prioritization on a team of one because like this is I think you said like something that resonate with me like

8:38learning to say to say no but maybe can you give your thoughts a bit on that yeah it's definitely hard and I I don't claim to have it all figured out by any means but um we do uh two weeks uh Sprint Cycles at sakota and so we do like I try to save like longer term planning for quarterly and then we break

8:56things down into cycles and within that I try to stay on a cycle as as much as I can stuff always comes up especially at a fast-paced startup um so you kind of just have to get good I think at understanding you know what's urgent and important versus the stuff that just feels really urgent and important and

9:11then I think the biggest thing I would say is like don't let back burner issues fester for too long and if Camp from sakota is listening like it's like things that we want to get to but like we haven't gotten to and then you kind of wait for it to become a really big problem I think those are the things

9:28that um sometimes when you're te of one can be hard to get your time into um but as much as possible try not to let those issues go for too long because then they'll become bigger problems for you later on yeah no it's true another thing I wanted to ask you um the just because

9:45you went to the snowflake Summit like before that into the SQL uh refactoring topic what is your your main highlights and have you been to have you been uh to the sequel uh snowflake sumits last year

9:59yeah I was the one last year I'm curious to hear your highlights of this uh this year and what is the DraStic change you've seen compared to last year yeah that's a good question I should cave it by saying both last year and this year I didn't make it to any sessions so I feel like my experience is like totally on

10:17the vendor floor and just talking to people yeah yeah like I find snowflake Summit it's a little harder actually for folks if they haven't been it's like kind of harder to get into sessions like you have to sign up beforehand and then they fill up really quickly um so it's a little bit of a different conference format so I feel like last

10:33year I didn't really learn that until too late um so I think I made it to like one session and then this year I didn't even make it to any but um I think I've been I've been noticing a little bit of this conversation online and I I kind of noticed this as well is like there's kind of this divide in the world of of

10:47of like AI it's like there's a lot of vendors and tools out there that are going like full force ahead into adding AI into their product and and trying to get as much value of it and pushing it and saying that thinking that it's really important versus maybe the tools that kind of don't feel it's as valuable

11:02or the data people who are more skeptical about it and I feel like we're still kind of in this like there's like this weird VI uh polarity of it where some people are really bought in and then others maybe aren't um so yeah I I think that's kind of we're still seeing that play out at a lot of conferences

11:18yeah and then I think yeah sorry go ahead no no I I mean I totally agree I feel like the next year will probably uh you know break or make it year for AI and see but I think like the the challenge is that the AI race came at the T turn economy I mean not really for

11:40but like data team were p on budget and and so uh I think it's kind of like created a mess I I was at the conference this week uh the Berlin bus word which is a search data uh conference in Berlin and a lot of people say like that like road map I've been uh comp completely change with the AI but

12:03they're not sure it was really worth it I don't know if you have the same feeling yeah yeah I do kind of feel like there's like it's kind of this I I everybody you know the hype cycle word that we use a lot but like it kind of feels like we're in like the frenzy of it where everybody's reallocating budget

12:18and prioritization to to AI projects so yeah I think it's definitely a and there I think there was a lot more people that I spoke to especially who were looking for jobs or who are kind of in like the transition of a role or something like that which definitely I think there was a little bit less of last year or maybe

12:33we hadn't quite gotten there yet so yeah I think it's it's definitely a strange time in the industry but um I'm sure that we'll see it come out in the wash in the next 12 months I'm sure next year it will be very different yeah and I'm curious if your uh if your road map has been changed because of AI please leave

12:49a message and tell us if you're if you're convinced or not uh because I think uh we are not we are not the only one so going um now uh on the main topic

13:03which is uh sequel spaghetti um I like to use this expression did you use the this expression or how do you refer like bad long you know long scrip sequel oh yeah I mean I think we Pro I would just call it a mess sometimes it's like it's like a rem model that it's like you it's like only you and God know

13:26what it does sometimes but it's like it's like you don't really understand work how it works anymore it's like when you go in there you just kind of want to shut the door and get back out I don't know if there's a word for that but it's uh it's like a bad feeling and so do did you have the experience on refactoring

13:41like uh you know significant chunk uh in

13:45the past like Sequel and how how did that happen yeah when I joined sakota we uh we went through a bit of a migration so we were on a a postgress warehouse previous and we moved over to snowflake um mostly CU we just wanted to to practice using snowflake not because I think we necessarily needed to at our

14:02stage yet but um many of our customers are snowflake users so we wanted to really get a good handle on the tool ourselves um but I think uh moving moving some of our Legacy code over into DBT core and into snowflake um there was some refactoring there and then um there was a project at my last company where

14:22uh one person on my team I was kind of overseeing them and they were doing the refactoring and so then I was doing the code review and so we were kind of both up in this refactoring nightmare and that was like a really pretty challenging project um log files that were supposed to be log files and then

14:36turns out that they weren't really logs they were a little bit different and yeah a lot of conversations back and forth with the engineering team to understand that yeah I think uh that's uh that's interesting point that we haven't uh I mean we have a couple of tips to discuss pragmatically how to refractor SQL but at the very end you're

14:56also kind of playing the detective because probably the person that write this sequel is not there anymore it's something like that what what what is your feeling about that already experience yeah yeah I think that was a couple things where we learned like um in in a past role that you know someone had left the company who had written the

15:16code and it was like okay we thought this was well documented but then when you really start to go into it and that person's not there anymore and you can't ask questions you start to realize we don't actually know what this person was thinking or what they were doing at this point and so that's I think the hard

15:29lesson you might learn of not documenting your code very well and even even for myself like I've written code before that I come back and look at six months later and I'm like I don't understand what that does anymore I don't remember what I was thinking it's a it's a good it's a good reflection like to to yourself it's like don't

15:47document for other people because people say I don't like documentation think just about yourself like you know coming back uh and it's the documentation the setup how to produce but but yeah it's it's true that a part of refactoring pipeline because here we basically going to talked about SQL itself right but part of refactoring pipeline is understanding you know how you how you

16:15touch down on the source of the data is it the best you know appropriate way to do it uh and for that you need as you mentioned like really put your detective hat and and go talk to to to people yeah

16:30I see a comment from Colleen the code is the documentation yeah yeah so true but but

16:39yeah and and I think like uh yeah and this is actually even harder when uh colum names are not explicit like if you if you see you know Source database if you if if a few folks in the audience work with SCP database or you been I'm not sure how what's the St to now but I've been like a couple years back and

17:00col and name are purly technical they doesn't mean anything and so um yeah if you if you don't have any documentation on that it's that sherish like yeah yeah for sure so yeah and so when do you think um so you've been uh so yeah talking to basically uh to teams and how

17:22we the data data is being feeded to the pipeline is important as part of refactoring but uh now focusing uh zooming on SQL itself when do you uh feel it's it's it's a it's the time to refactor what's what kind of signs do you do you see that give you inch to to refactor a SQL

17:45pipeline yeah there can definitely be a couple different scenarios I think the one that I mentioned before would have been like we were migrating to a different tool so we were previously uh you know someone was building in post grass and they were kind of just like rerunning views whenever they needed a query um but we wanted to move to DBT

18:05and then move to snowflake so in that it was like you know we could have just lifted and shifted the queries and probably been relatively successful but then like you know you want to adopt best practices for DBT and and start to like break the models down into component pieces and so um that that's kind of one example I think is like you

18:22know if you're doing a database or data warehouse migration and then also moving to something like um DBT like that would be definitely a scenario another one could be um you know if like if it's a performance thing and and we talked a little bit about this beforehand but it's like it's not necessarily that you're like trying to get like a model

18:40down um totally optimized but it might be that that model is like a bottleneck in your whole pipeline so um my last company uh oftentimes these bottlenecks tend to be kind of like the core concept of the business and so at my last company we had a model called consults and it was like basically every single model in the company it seemed like went

18:58into that model and then or like every single data source not every single one but a lot of them and then all of our core reporting came out of that model it was like very very important to the business and so for that model to not function well kind of just slowed like everything down and made everything

19:13perform not as well um so that was one that you know we had to think about refactoring yeah and also pretty high stakes um and then the last one is probably more when it's like you know you your code isn't really readable or you know your formatting is off or or things are just not really e uh easily

19:30usable for your team so if things you know maybe you've got the same Concepts in like multiple places in your DBT project um you might want to think about refactoring some of that to kind of bring some efficiency you can improve the performance of the whole pipeline um but you can also just make it easier to read so that you know future you or the

19:48rest of your team is not afraid to kind of like go and in touch anything and just have have it be very difficult to actually work with the project yeah I know that's a really quick point I can reason and I can do analogy because I'm moving apartments and uh when you move apartments or houses it's the best time

20:07to clean up right it's like you you sort things like I don't need this or maybe I can optimize this and it's the same when you migrate to a different database or you use another tool so you have to do some work of moving things around anyway so it's the best opportunity to revisit things um and then uh yeah the second

20:29like as mentioned I start to think I think losing velocity and I still like you were mentioning team of one and hiring I think um I've been also in a team of one not data actually yeah data team like shortly and but also when I have a new hire in my team like the first thing I do is going through the documentation to

20:52like how can this one be you know as fast as possible up to speed and help me out Bally but you know my uh my will and um

21:03and so I go back to the documentation and so it's all a question of velocity of uh people contributing to the pipel or yourself but also new people right uh joining your team um and so yeah and so at that point is it's better for them to to to jump on that we have a few comments here Eva saying getting more

21:26people applying for analytics job and F and with some prompt course they can actually do analytic we hav have talked about that or maybe we can talk about it right now um do you need actually to to do how to do refactoring now that we we have you know CH GPT in an other Fe yeah yeah I had a friend messaged me

21:47when when they saw the that we were doing the session and they were like a I can just fix your spaghetti sequel for you and I was like listen sure it probably can do a lot of that but you still as a person I think it's a same with any type of coding like if if someone doesn't really know what they're

22:02doing and they just start asking AI to do all of this for them um you're probably going to have a bad time it's probably going to turn out to either be inaccurate or you know things aren't necessarily going to be set up exactly the way you want that's not to say that AI can't help you and speed things up

22:16like I think with some very rigorous prompt engineering and some good conventions and and guidelines on what you want to do I think it can probably help out with a lot of that but you still want to kind of have like a basic framework of how you want to think about it because at this point right now I'm sure

22:31this will change over the next 12 months and and five years or whatever but AI still needs a lot of guidance from humans to be able to like do things properly and do a good job of it um so if you don't really know what you're doing and you're trying to get AI to help you then you're probably going to

22:45come out poor on the other side my answer might be different in a couple years but we're not there yet yeah and I think uh something that a will still miss like for a long long time is enough gun context so business context right so you can say yeah oh but if AI knows the business yeah but who feed the business

23:06context who feed that you know what's the definition of a customer so uh that that's a human like because a customer like as a different definition in different company right um so I think uh because of that um there is also good opportunity for people to go you know a bit more to kind of business and some of

23:28the technical part to say you know make that code more readable or split that SQL query into um multiple queries uh

23:38that would be you know helpful to uh end on to any AI but again you need to know what you do and business context needs to um do know what's the I'm curious what's the worst situation you had with like business understanding when writing SQL m well there's probably so many of it's like there's probably so many

24:02situations where I've just that I probably blocked out of my memory to be honest but I I think it's more about like you know it's understanding the business but then it's also understanding the nuances of the data in your business because a lot of times like how the data is collected and who's collecting it uh matters a lot um a lot

24:19of times like engineering teams are obviously responsible for building production databases and so they're they're thinking about how to build the data to service a product not to service analytics most of the time and so the analytics team is then just going and taking that data that's available and using it however we see fit for Downstream purposes um but that goes

24:37back to that like data contracts and data producers and consumers conversation is like you know if they don't know how we're using the data and we don't we can't really communicate to them about you know this is what we need and we can't have these things change um I think that is a lot of like business understanding and then also data

24:55collection understanding and then agreements with with data producers that often goes overlooked and I think those have probably been the biggest headaches for me um I'm sure most people on the call yeah share share with us if you have uh uh what are your challenge when understanding uh basically Source data when writing your SQL query and uh if

25:19you have some pain with understanding the business concept uh and how did you solve that yeah that's that's also because I feel like at the end is always just talking to people that's like that's that's what I found horrible like by default I don't really like people you know so so it's like talking to them like is

25:40it if it's the default solution uh that's a bit scary U but yeah I'm getting I'm getting PTSD with this conversation yeah I I I don't what you

25:53feel um so uh let's let's go so we we

25:57talked about um when to refactor um and

26:02um now the question is basically what uh what to refactor on the SQ quy and you prepare a couple of slid you want to to jump on those sure sh my screen I I can share I can share directly oh you got just me yeah um I'm not I'm just I'm going to share it also

26:28uh with uh with the chat if you want to follow Alone um so just a couple of uh of notes and practic and after we will'll dive into uh uh dive into the codes and let

26:43me sure and I'm going not sure which uh sharing screen because I've been traveling so my setup for lifeu might been changing uh let me just uh I was on

26:55vacation for two weeks and I was surprised that I remembered the the password to my laptop it's always the worst when you get back and you're like you know you had a good vacation when you need it to reset your password for a yeah I I I know what you mean uh so let me

27:15just I'm going to display this it's a life it's a live fix you see that's that's a that's a live fix that's a life fix right there you know say everything is light no cheating now you know that I it's not recorded right we are not we we are being honest with you um all right um

27:44so why refractor code we have been talking but I'll you alast you talked about it yeah yeah yeah we touched on this already but I think like you said like the biggest one is you know if you're like moving your house or your apartment or whatever you're moving to a new tool or you're moving to a new

27:57Warehouse that's a really good time to think about refactoring your code because in some cases you might have to there might be you know if you're moving from post grust SQL to snowflake the some of the some of the functions don't work and so you're going to have to rewrite some SQL anyway um but then also you know maybe you're going from stored

28:13procedures to DBT that would be you know a pretty big pretty big migration a pretty big change but you want to be you know thinking about adopting best practices as you're moving to these new tools um and using them in ways that are they're kind of designed for yeah the second one we talked about was code performance and I guess that's more like

28:31you know did it it could be like a specific model that's degraded and that was the the model the model neck I think Jeremy con from DBT came up with that term but it's like a a model in your whole DBT project that becomes a very big bottleneck and that can kind of slow down the whole system but it also could

28:47be you know maybe something worked well before but now there's a lot more Source data flowing into a model and you need to think about improving the performance and that could refactoring could help that or it could be other methods you know moving to an incremental model something else like that if you're using DBT and then the spaghetti dag which is

29:03really more of just like you know you're kind of projects become a bit of a mess it's a little hard to navigate maybe the documentation gotten out of date maybe there's models that nobody really knows how they work anymore and you want to just go in and and fix those up yeah and so that that's a good point that we

29:18mentioned earlier like um sometime we discuss about uh you know SQL optimization I mean optimization of queries in general en and uh refactoring and those two can be done together but I feel like uh for me refactory in my mind is make the code you know more readable and maintainable which increase the productivity and velocity no matter you

29:45know how efficient it touch down after the the database you know which joint are you using are you filtering on an index or whatsoever and I think uh you you can you can also refactor code just for optimization um that's completely true if your performance uh is is degrading um but but yeah that's that's basically I feel like at least in my mind I'm

30:09curious to hear in the audience if you have different definition on optimization in refactoring refactoring can be done for optimization but in my mind it's often like the easy way is just to uh to start refactoring some piece and just making more readable and maintainable I don't know what you think yeah yeah I agree with that I do kind of

30:29feel like they yeah there's like a a tricky side of it where I feel like you want to separate those two things sometimes as well like because if you start to refactor something and then you kind of go down to the the performance issue or like all these other things then you kind of get sidetracked of like

30:44your original goal yeah and then your your PR is going to get really big it's going to get very hard to figure out what you're doing so definitely yeah I think maybe the motivations for refactoring then also just determining like are you working on performance or are you working more on formatting and and readability and maintainability yeah

30:59those are different things for sure um yeah so you have a couple of best practice here yeah so and I will say some of these I I definitely have borrowed from working more with DBT and so um at least the best practices that DBT recommends and ones that I've found have worked really well in my experience is uh a couple of them the first one is

31:18kind of this they call it like DBT calls it an alongside strategy rather than an in place and simply what that means is that you would duplicate um either you know in the in the previous example where I mentioned we were moving from postgress to Snowflake and we were bringing things over into DBT um I brought the model over the SQL query

31:36over exactly as it was and just created a model called like Legacy whatever the name of the model was and then I made a copy of it and I started working on the copy and so the purpose of doing that is so that you would then be able to compare you know the previous state with the new refactored state and so it

31:52allows for a lot easier QA and then you know when you're finished with that if you have Downstream dependency is like bi tools or other things that are connected to that model you can just kind of cut over the source um and hopefully that's not too difficult in whatever tool you're using um but that is a generally pretty I guess I would

32:10say maybe a more conservative or a safer strategy that you know will probably save you some QA time but it might make it a little bit more difficult to do that cut over at the end depending on how many things are connected to that model yeah yeah I think it's a like that's one of the thing when your

32:24factory don't want to break things right so um I think duplicating model is often like the thing I've seen happening yeah so far yeah makes sense and then the next one is uh and this is like maybe controversial for people who like subqueries but uh CTE are usually uh Common Table expressions are usually quite a lot more readable uh and so

32:48moving if you've got you know I've worked with a lot of queries that have been like nested subqueries and it's very difficult to conceptually understand what's going on when you start to get layer ERS of more than a few layers into a nested subquery even like one layer in my opinion and so breaking those out into CTE can make it

33:05a lot more readable and you can even then you know if you name your CTE very well then it makes it very clear of like you know this piece of the code does this and this is the name that I gave that and so um you know I see a lot of times people will still write like a a

33:19is the name of the CTE or B or C or whatever or um make them not very clear and so I think just being very clear on you know not just using C but then using names to really explain what you're doing and making it super readable no formatting yeah that's one yeah I think the hardest thing about

33:38formatting is choosing conventions and then like actually sticking with lower case upper case yeah for lower case and upper case in the chat let's let's start let's start the war let's start the war leading thas let let's start the let's start the war for a second so wish for uh wish for um upper case or lower case

34:02uh I'm just checking quickly the

34:07comments yeah so I I we just had a comment on YouTube since more user is going into the data spare uh uh I think it's important to also think about refactoring as creating good abstraction that Outsider can easily uh uh grasp yeah uh definitely

34:27that's that's that's very true um so uh

34:32and also CT are a reflection of your mind organized yeah very true Anthony I don't know how organized my mind is Anthony but I try uh but but yeah and so let's let's

34:46just uh go back I was quickly uh catching up um yeah so uh last one

34:53modeling layer mhm yeah this one is definely for folks who maybe aren't um as familiar with tools like DBT this um I I I wasn't as familiar with this before I started using DBT uh and so this is really the concept of like separating your Transformations into more of a pipeline so kind of separating um you know having

35:15a staging layer where you land your data and that's just kind of like a copy of your raw data and you might do some light um cleanup renaming of columns that type of stuff U but you're not really doing like heavy transformations in that layer it's really just meant to kind of like a connection to your data

35:29source the intermediate layer where you might need to do you know if you need to do a pivot or you need to kind of do some things that might get reused in multiple places that can get pulled into an intermediate layer and then the Mart layer is really for the the business user like the end business concept of

35:45what you're trying to achieve and so if you're doing dimensional modeling that would be you know fact and dimensional models most likely uh but this would be you know typically organized by uh maybe area of business so in in our project at sakoto I I put models together that are like for the product team or for the

36:02marketing team or sales that type of thing yeah um and so that kind of gives you the ability instead of like having this like huge query where you're trying to you know you're staging the data you're doing pivots and Transformations and then you're also kind of preparing it for the Mart layer you're separating those out and then referencing them um

36:18so it kind of makes a nicer dag from left to right if you're used to the DBT concept um rather than doing it all in one big sub query yeah no that makes um

36:31a lot of sense things to watch out yeah we talked about this on the first slide but um avoiding major logic changes so it's very hard sometimes I find when you're you refactor something and you're like oh I actually want to change this and I want to fix that and this would be better to to do it this way it's better

36:47to save that stuff for later so as much as possible try to stick to the task at hand and think like I'm refactoring I'm you know maybe I'm taking a really long model and I'm breaking it out into to layers or um fixing formatting or changing subqueries to CTE um if you want to change how something's done like a major

37:04logic change open up a jur ticket or linear ticket or whatever and come back to that when you're done because if you start going down that path at the same time you're doing refactoring it's going to get pretty overwhelming pretty fast and I've definitely made that mistake before yeah uh yeah so um it's um it's

37:24true that you shouldn't like it's it's it's a common mistake with your factoring for anything don't get sidetracked we talked about that right but like if you need to fix something don't say oh you you know that's oh I'm hitting this so I'm going to fix that and then I'm going to fix the initial thing I did and then you're going into

37:41the rapid hole so scope it to you know one thing like I'm formatting or I'm just you know um replacing all the type of joints uh to make it more effective I don't know but yeah there is also uh a m share uh a while back I don't um I don't

38:01know uh it's it's someone like trying to fix a a light and he he goes to you know

38:10where his tools are and then find that uh bring a new light and then found out that uh the Shelf is broken so he start to fix the Shelf then he start to see that he doesn't have the right tool so it's going out to buy the tool F the shell so it's like the kind of thing

38:25it's really common in like it's soft engineer refactoring to get sidetracked sure so please don't all right um we have around uh 20 minutes uh to start uh coding so we prepared uh a project so a

38:41simple uh repo um I'm going to

38:45share uh the the Repro right now on both uh

38:52our YouTube folks failed to post comment of course SQL ref voring V

39:01here LinkedIn is being LinkedIn and doesn't want me to comment on the the events um but you can alut anyway after

39:10U let me try again yeah it's gone all right so um this is the the repository um you can use uh a Dev container if you're not familiar you can just open it in in code space and you have the environment ready um this is also one thing I always do uh when doing refactoring is that um you need to have

39:36a development space and this one should be easily be reproducible right um I

39:42think it's it's something that we haven't talked uh so far but basically the way like the way that you set up your development environment so that anyone can you know contribute it's it's also important do you have a few thoughts on that yeah yeah I mean I think at least in my experience I I definitely don't have as much experience with the

40:05container situation but um but setting up like a development environment in Snowflake um for example we have our development um warehouse and then so anything that we do locally in development is always built into that warehouse in a perfect world we would move over I would love to to play around with moving to mother duck because obviously you do when you have a

40:26development environment On Tools like snowflake or red shift or whatever like you're paying when you are doing testing in in local SOC um I'm if anyone is followed my Cost Containment discussions I'm pretty like cost sensitive as a data team of one at an early stage startup so um yeah so shout out to Mother deck for that but but um but essentially we have

40:45our Dev Warehouse we have our production warehouse and so anything that we're doing locally is always built into that development environment so that we can test it we can play around with it you can even compare like if you wanted to compare Legacy query put of a legacy query to your refactored query you can do all of that before pushing to

41:03production yeah uh yeah and we're going to use uh like in the example I'm using DBT and D DB uh because again uh it's uh

41:12great to have a local development environment uh working uh you're not paying especially for just uh the sake of the example here um so basically here I have uh this long SQL spaghetti right um there is a couple of subqueries right and um we're saying like yeah you're not like using AI for refactoring you need to know what you want right uh

41:41well I asked AI to regenerate me an ugly query it was pretty easy they did it that pretty well so you know who knows the data is being TR out anyway so uh this is the first uh query um it's not too long long and so the first thing that we want to do basically probably is formatting right mhm yeah um

42:05so you mentioned a few Tool uh what do you usually do use uh so I usually use uh SQL fluff

42:14personally for getting things to a state that fit my conventions and I I like that I know there's a lot of um sequel IDs that have like a button that you just press and it will just change the the SQL to whatever you want but in my experience that's not super customizable so with SQL fluff um if anyone's heard

42:30of that you can set up your own uh SQL fluff file configuration file and that you know then you determine like what are my conventions and then how do I set up SQL fluff in a way that will enforce that so makes it so really first to reactory is important and I just wanted to share uh that uh image and I'll share

42:48also the uh the blog which is a really a great blog uh to read about uh refactoring uh uh and DVT also has a lot of resource output after but this image uh because I've seen that I've seen that in person people printing SQL query so this is a huge SQL query all right this is before AI not generated someone write this um

43:17and so and so like you need to have a visibility to be able to start you know to make the piece together and like building CTE that's what we we're going to do um so SQL fluff uh is a really

43:32great tool you w mentioning ID actually there's a lot of ID that has the extension that support SQL fluff so FS Cod do you uh has integrated SQL fluff so you can do format then uh you just have to configure it properly um and this supports multiple dialects right you didn't just need to put the DI so

43:52I'll use it manually just to show you um so you just to a peep install I think I already install it here but you just do a peep install SQL fluff right um and then you're going to do uh you have so you have different things in SQL flu let me just increase uh that so you can um format and Link so

44:20it's going to give you error suggestion if you break the convention so this is also uh you know we saying like formatting things is easy right but you could defining the the the the linting rule and saying yeah we we allow like you know that much lens in the in the in the SQL and Etc so this is you can

44:45actually Define those rules and I think you should spend some time you can use the default one but you may not be happy at some point uh so it's uh it's good to

44:56to to Def those um as well as you mentioned so if you just do SQL fluff

45:04and then we're going to do format we're going to do dialect uhb and we're going to point to our file

45:16I believe it's not finding my file because it's just right there Y and now this is it so that's much more already reved than one line I was I push the things really hard with one line you know but sometimes you know when you exploring things and so on you you might get like depending on the interface a query like

45:42that uh anyway so now we have what is

45:46this I know you like this one subqueries I hate subqueries so much so uh we have uh we

45:55have subqueries and uh we have a couple of um of join um

46:02and and yeah so basically what would be what would be your advice to to move like from there and let's say we start simple DBT project yeah so the the pattern that DBT usually recommends following is to sort of separate your query into what they call like import CTE versus logic CTS okay and so in this case like I would

46:25probably identify all of the DRS that I see and I would move those into kind of their each of their own import CTE so I'd probably start with customers um maybe or orders doesn't really matter in this case yeah um and I'd move those into CTE at the top first yeah and then we can kind of take some of those joins

46:43out and start to move some of that stuff around so the good news is that I already don't done that but she doesn't know I wanted to know if I was a good student you know uh but one thing uh maybe is uh we were talking about you know duplicate or replace right so maybe

47:03for people not much comfortable with DBT is like um so what I did is create like

47:10Legacy folder and I just put back the same query right uh within DBT and the only thing I did is basically specifying you know my source uh with a source yo file that Define where those table are because I feel like I'm curious what you you fa is uh for some people that never touch DBT can be overwhelming with all

47:38the configuration don't you think MH yeah it's definitely a learning curve like for someone who wrote SQL for many years and then started using DBT at first I was kind of like what's going on here but when you kind of realize like it's mostly just SQL and you then have to understand I think you have to understand a little bit more about

47:56variables and start to kind of understand like python a little bit better not necessarily you don't I I don't write python very well at all so you can definitely get by but I think having those Concepts in your mind about being more component focused and like object focused is definitely useful yeah and so here basically um for those

48:15people not comfortable with DBT but quickly it's like we just replacing the variable of the table name with a a variable that we Define in the sourcer KL so it's refering to specific Source uh that they point to a table um and that's Ginga template if you're familiar with python so this is basically getting parts after in the DBT engine and

48:40executing the final query um something I want to touch down is that um are you familiar with the tpch extension on ddb not sure if you're familiar with DDP itself it's getting familiar but all right not as much so I'm getting just a bit sidet track because I felt it it was interesting so I needed a new data set

49:02uh for uh for this uh for this example

49:06let me go uh here so I'm just at the rout here and I'm going to launch uh tdb and by the way if you open the dev container you have everything ready right uh DBT DB CLI it's installed for you with just one click um so if you do DB and then I let's say pass example uh DB so it's going to persist my

49:30example database over there it's just what it is going to do so uh what's super nice with uh dug DB is that they have an extension uh tpch and TPC UHS

49:42for uh so those data set are used for benchmarking and you can generate them pretty easily with a scale factor so oops so you just call DB gen and the

49:56scale factor is basically all large you want your dat set so here it's going to be uh we we're going to we're going to try it out but I think uh it's roughly like 100 megabytes or a bit less and some tables going to contain uh like a million table million throws so it's done and so now if I'm

50:17show uh if I'm doing a show tables I have all those tables that been generated quickly and if I do let's do

50:26uh select count on from line time like yeah we have a we have roughly yeah a bit more than one million drills on that right and this is taking taking yeah okay 50 a bit less I

50:46expected uh 50 megabytes okay so this is what I did for generating basically the example so all the queries are based on the tpch uh data sets queries um and so this data

51:01set is actually there locally so there is already a database there it's a DV database it's a one file that contain all the tables um so that's also super cool because if you think about refactoring we were talking just before um you can also get a sample of all your data into one single file locally and then you're

51:24ready to basically iterate and do all this refactoring mostly isn't that beautiful yeah awesome um so so yeah

51:36coming back now uh s TR off to the refactoring so you mentioned so I I I I

51:43did this one um and so uh just replace and copy the Legacy

51:50uh SQL so for as a reminder what we did so far is that we linted using SQL fluff

51:58um and then uh copy paste it into here replace the valuable uh so that's one first thing that you can see okay everything is still working and now I'm using a DBT project and my code is formatted next

52:14what was next was to uh split it into

52:19CTE right yep or yeah I guess in this case like so I guess it kind of depends depends if you're talking about modeling layers as well like if in in the best case scenario you would start with creating staging models you don't really want to connect directly to sources usually in DBT unless it's in a staging model so I think that's probably what

52:39you've got maybe on there on the right hand side no I don't because I was a bad student you know I cannot be always a good student that's the thing so you you need to make your staging models first okay so a staging so you would do like a staging folder for example um and

52:58oops uh we have a folder not a file right that's not going to work otherwise uh yeah of course of course you're not

53:10happy I do that like every you have uh

53:14yeah you have do uh a staging folder and

53:19you would start with like a customer yeah I'd probably maybe start there first make a staging customer model where you've got all of the essentially where you have like customer key as customer ID name as customer name like you would move all that into your staging model so that you only touch the source once and then you're just doing

53:38all that renaming and formatting essentially in that first layer okay I see so um yeah yeah exactly what um it's funny because uh this is uh what I'm really comfortable with what did called the midan architecture you know BR silver gold but it's it's funny when using DBT and SQL this is not like coming to uh to me so uh naturally I

54:03think it's because it's still SQL like typically like in the bronze area what you do is or you have Landing Zone where you don't touch a file and then you format just the the file format usually to make it more efficient for read like typically easy to park it so on uh but so here in the staging area you would

54:22say like uh renaming things basically and just select The Columns that you need that's rly that yeah typically that and then there might be some like light cleaning if there's things you want to remove but usually you would save that until like the next layer potentially okay we have four minutes left so I'm not going to uh go in that just

54:41summarizing uh what what would be the takeaway um and so here we have uh another uh CTA so for example this one should probably so you would create three staging area basically uh for each uh yeah I guess in this Cas there be one two three four models that I would create four yeah four staging models so

55:07four staging models and then and then the logic yeah um and so here basically

55:13I B kind of bypass that uh but then there is uh yeah just a weave statement um and a couple of joints but each one has a bit less joint so I basically split if you look at uh here

55:29those subqueries M right are uh they their own

55:35uh their own model uh so we have the order details here um and yeah and so

55:43then we have the final one which is relying on the customer details so it's the orderline details um and so I'm refering the other model here uh which is the CT and uh that's the final data set uh that I want um yeah that's that's roughly it's like if you're coming back to what we said of course we can do much more in

56:08term of like do you have any advice regarding joints like how do you tackle those yeah usually I would try to I try to do all of those near the the end of the query if I can um unless there's like a requirement that you know like if it's a performance thing or if there's something that I need to filter down

56:27before doing some kind of transformation but typically like I notice a lot of times people will join the same concept multiple times in a query and it's like if you just wait to do all the joins near the end you're saving yourself it should give you some performance gains and it also just makes it a lot easier

56:40to read you don't have to do things multiple times um yeah that's generally what I recommend yeah um and so if you

56:49want to to to run the things and by the way also I I we haven't touched that but there is also now DBT has an amazing feature for testing which should help you refactoring and do you know um do

57:03those those those small testing case and validating but basically uh what you need to do after is just run a DBT uh run and in the right folder so the SQL ravoli is my DBT folder uh project because RAV is a bit less mess than spaghetti right just that yeah um but I I do love both right

57:30eating those but anyway um and so uh if

57:34you run uh DBT run Target Dev you see that I'm pointing to the local uh database right uh let me open

57:46quickly um this database which is in secal Riv data tpch uh so this is basically um what we

57:57had and you see you already have this is the model this is the different models there is the source also uh over there this is my SQL spaghetti which is the original run right um and so yeah so

58:11basically here the source data is there with like the the the output because it's just um you know a DA database and then when you're going to prod basically you just uh specify uh here it's mod deck and if you have a mod duck tokon you can so sign up for free to Modern deck there is a feature you put your

58:31token in the UN viable and then you can switch uh

58:37basically to the cloud so right there is this clear uh basically we're talking about developer experience where it's actually much easier um nothing to sell but like it's really hard to simulate a snowflake locally or bigquery bqu has an ulator to run locally but it's really hard um and so just to show you so if I

59:00run this that's will run around the the data set and of course it's not working uh because I'm opening the database at the moment so there is a conflict um so let's let's run again it's done um so

59:15you see that that was super fast right and we still had like some tables with uh in row if you remember um and uh

59:24basically the DS say is not P to uh to this and if I want to inspect again so this is myal by G table so this is the final table

59:42which is um not refactored and then we have the other items detail which

59:53was this one right MH which was refactory using the the different CT yeah that's roughly the the the same

60:07uh the same data set yeah so that's that that was it for there is so much things that uh we could do regarding refactoring right um was just I think scratching the surface but still um one key thing that you would say like what's the easy thing people should differently start thinking when refactoring that's a no-brainer for

60:33you yeah I mean I think for me it probably goes back to to separating the models out based on the the stage so having staging intermediate and and core layers is really going to make things clean things up especially if you've got like a couple thousand lines in in a query um and then the CTE uh there's

60:51been some discussion in the comments about about this but yeah like just naming CTE making it really clear on what your code is actually doing so that people can follow along is uh sometimes writing the code in a way that's more readable is better than writing comments so yeah yeah um because we someone said the the the quote is the documentation

61:11yeah exactly and there is a lot of debate like people say yeah your your comment is going to be you know out ofd more quickly than your code because often people update code with a non update documentation or comments so that's true but I feel like it's a it's a fine line to to find right it's fine

61:31line to find why are you uh to finish on why are you the most excited for uh in the next coming months uh in data for for you or just in general in the technology or the space that's a really good question um yeah I think when we were talking about this earlier like the kind of the AI

61:50piece I think for me it's like I'm interested to see that become more useful like I think right now to me it's like it's still very conversational it's like you you ask a question you get a response back like I'm interested to see more of that turned into you can you know can actually do things for you that

62:04are a little bit more automated um so I think just seeing it get applied in more useful ways is definitely going to be pretty impactful in the data space cool uh thank you so much for joining us uh so Quack and code is happening every other week in two weeks I will have another guest where we Quack and coat uh

62:25uh don't don't forget to go check out L say uh it's pocast right woman's lead data um it's amazing initiative and I was super happy to have you I think uh I think we met I didn't think we met in 101 virtually we first met in person is

62:45that yeah it was I think it was at coass last year yeah yeah yeah so it's a weird thing I mean we've been collaborating like async with the DS Fest right but so that's this weird moment where I actually saw lay first in person and then yeah on night so this also can happen you see it's true people do meet

63:03in real life sometimes yeah all right have a great day evening wherever you are thanks for having me and uh I'll see you around bye everyone

FAQS

When should you refactor SQL queries in your data pipeline?

A few scenarios signal it is time to refactor SQL. One is when you are migrating to a new data warehouse or tool (say, moving from Postgres to Snowflake or adopting dbt). Another is when a specific model becomes a performance bottleneck that slows down your entire pipeline. And the third is when your code has become unreadable or unmaintainable, the "spaghetti DAG" problem. Migration is particularly good timing because you already need to move things around, much like how moving apartments is the best time to declutter.

What are best practices for refactoring SQL in dbt projects?

Some well-tested practices include: using an alongside strategy where you duplicate the original model as a "legacy" copy and refactor a separate version for easier QA comparison; converting nested subqueries to CTEs (Common Table Expressions) with descriptive names that explain what each section does; and organizing models into staging, intermediate, and mart layers where staging handles light cleanup, intermediate handles pivots and reusable logic, and marts contain the final business-facing output. Learn more about dbt integration with DuckDB.

How do you use SQL Fluff for formatting and linting SQL code?

SQL Fluff is a tool for formatting and linting SQL that supports multiple dialects including DuckDB. You install it via pip install sqlfluff, then run sqlfluff format --dialect duckdb your_file.sql to auto-format code, or sqlfluff lint to get error suggestions based on your conventions. SQL Fluff is highly customizable through a configuration file where you define linting rules like line length limits, case conventions, and other standards. It also integrates with VS Code as an extension for inline formatting.

Can AI tools replace manual SQL refactoring?

AI can help speed up SQL refactoring with careful prompt engineering, but it still needs significant human guidance to produce accurate results. AI lacks business context: it cannot know your company's definition of a "customer" or why a column is named a certain way. Without understanding the business logic behind the code, AI-generated refactoring may be inaccurate or miss important nuances. The best approach is to use AI as an assistant while maintaining a solid framework for how you want to structure and refactor your code.

How does DuckDB help with local SQL development and refactoring?

DuckDB is a strong choice for local SQL development and refactoring because you can store a sample of your data in a single local .db file and iterate rapidly without any cloud dependencies or costs. Combined with dbt, you can run dbt run --target dev against a local DuckDB database for fast feedback loops during refactoring. When you are ready for production, just switch the target to MotherDuck by changing an environment variable. The same DuckDB SQL dialect works in both environments.