Lakehouse or Data Warehouse – Ep. 305

In this episode of the Explicit Measures podcast, Mike and Tommy explore whether you should build on a Lakehouse or a Data Warehouse in Microsoft Fabric. They break down practical trade-offs and how to choose the right fit for your organization.

Main Discussion

Mike and Tommy compare Lakehouse vs. Data Warehouse choices in Fabric with a practical lens: the right answer depends on workload type, team skill set, governance needs, and how downstream reporting will consume the data.

Key points from the conversation:

Workload fit: warehouse-first tends to shine for SQL-heavy, structured reporting workloads; lakehouse-first fits better when you need Spark/engineering patterns, semi-structured data, or data science workflows.
Modeling + performance expectations: the team talks through how your modeling approach (and how you plan to serve semantic models) should influence the decision.
Team workflow and repeatability: the “best” choice is often the one your team can deliver consistently and support over time.
Governance + reuse: where you standardize transformations and business logic determines whether you get reuse—or a pile of slightly different copies.
Start simple: pick a thin slice, prove the pattern, then expand; don’t over-architect before you’ve validated the consumption path.

Looking Forward

If you want a deeper dive on when to use each approach, send us your scenario and we’ll cover it in a future episode.

Episode Transcript

0:30 good morning and welcome back to the explicit meters podcast with Tommy Seth and Mike good morning everyone good morning and happy Tuesday gentlemen it it is a happy Tuesday it is jumping back into things here just for letting everyone know this is a pre-recorded episode so this will be a recorded episode for those of you who are joining live just FYI about that real quick our main topic for today will be talking about Fabric and focusing on two area aspects of Microsoft

1:00 on two area aspects of Microsoft fabric The Lakehouse or data warehouse thinking through where they are good for and potentially where should we be spending our time in building our analytical engine or analytical Solutions nowadays so that’ll be our option there and then before we get to that main topic there let’s do a couple news articles so Tommy over to you yeah so there’s actually a new white paper in the works by Marco Russo and Alberto Ferrari and it’s on all about visual calculations it’s still

1:32 all about visual calculations it’s still very much in draft but it’s basically the article or the the article calculations yeah it looks the both the how it’s in draft did they say it’s in draft Yeah the the first box says Linkin thing this is their white paper you’re talking about you’re saying this this is a white paper you didn’t explain that very well I said white paper on top of okay got it I’m with you got it a white paper that’s in draft maybe maybe Mike needs to wake

2:03 in draft maybe maybe Mike needs to wake up a little more I think I need more coffee I I think it was a white paper that Mike or Marco Russo wrote there we go aari I thought you said new website I heard you said white paper I heard website oh boy this is going to be an interesting episode already all right let’s start over Tommy you got okay so squel bi maruso Alberto Ferrari Chow friends have created a page that’s a wi paper that right now has the structure of what will be a very complete set of

2:36 of what will be a very complete set of education and understanding about how visual calculations work right now the content only has really the agenda and some introductions but if you’ve noticed SQL bi’s website their blogs there’s been a lot about visual calculations I think on what day was it on March 16th Marco Russo visual shape visual C calculation so they’re doing a lot more articles and I think they’re going to put all this into the white paper so this is pretty I don’t know this

3:07 pretty I don’t know this significant if Marco Russell and Alberto Ferrari are spending the time to do the white paper they’re obviously seeing the need probably with their community and who they’re talking to they still don’t exist in the model it is Dax but we’ve had a lot of conversations about this but I think this is a Telltale sign to me the fact that they’re going ahead and putting out a lot of time into visual calculations well they’ve always been very proponent for this yeah that’s what I was going to

3:37 for this yeah that’s what I was going to say this is not a surprise to me they’ve been super advocates for visual calculations for quite a while now I calculations for quite a while now anything what I would say is what’s mean anything what I would say is what’s not a surprise is everything relating to a model or how something operates with Dax they are all over all the time so I’m I’m super excited that they’re getting engaged here because it is it is different right how how visual cels work within that framework and having

4:08 within that framework and having explanations especially in the detail that they provide is going to be extremely helpful for the community for sure now the interesting interesting thing here as well with this article is you don’t get this white paper you can’t go get it you have to be a subscriber of their SQL bi Plus Membership in order to get the white paper so they’re going to talk you through the ideas here but they’re not releasing the white paper out to non SQL plus sqlbi plus members which is interesting as

4:39 members which is interesting as well so even just walking through like the the items they have in listed in the in the outline here’s the following modules that are going to be handled inside the white paper right one of them is just what is a visual calculation what are window functions understanding the visual shape which is interesting creating new context columns or something like that and yeah I think this is going to be I think this will be very useful and if you think about what visual calculations can do right it it’s an after effect of once you’ve

5:09 an after effect of once you’ve already produced the aggregations across the entire model it can it can simplify or reduce the load that you need to compute because it’s now only Computing inside the context of a visual what do you guys think is this is this is visual calculations all hype or like a real a real thing go ahead go ahead Tommy at first I think I was a little a little conservative in terms of thinking what visual calculations

5:40 of thinking what visual calculations going to do because again the any calculation you create only lives really in that visual it’s not actually shown in the model so we’re talking about the lack of metrics but you the lack of metrics but I don’t know know I don’t know if a while ago about report level measures and like all the needs to create design based report measures it’s to show trending and arrow and you there’s a lot more use cases that I don’t want to blow up my model

6:12 that I don’t want to blow up my model especially if it’s a simple just show me previous over previous or growth or comparison in that visual very much like Excel that is never ever going to be in a sense part maybe it’s never part of a certified semantic model but it’s just a visual cue I honestly now that I’m looking more and more at this I’ve seen a lot of people writing about this I’m still wondering all the different use cases or when it makes sense to do it do we is that something

6:43 sense to do it do we is that something you we go to your priority list just like right now we go summary page drill through we always need a drill through page or usually will visual calculations have that same precedence in my checklist I don’t know I think there’s there’s two that I think about when I think about visual calculations one is the governance and control of it just from a people can write whatever measur they need to on the other hand I think of it from a from a scalability and speed standpoint right so it doesn’t make sense if you have fact tables that

7:15 make sense if you have fact tables that have billions of rows in them and you’re summarizing things across those rows it doesn’t make sense for you to have to build complicated measures to get you this period versus last period type calculations or one row minus the second row like sorting and ordering things iide the visual so I think from a performance standpoint this makes a lot of sense my my my hesitation is how discoverable will these be and are they going to potentially change the models that the central bi team

7:45 team produces and shift them away from something that is centrally modeled to what the user needs and I and again I I feel like there that I think the use case is strong enough to say yes this is valuable where I’m where I’m still hesitant right now is how what’s the best way to monitoring and see what’s going on in these visuals and what is the performance impact on those to rendering the report for end users right so if if an end user comes into me and says hey my visual is slow okay great I can run the query and see what it does and if the engine is

8:17 see what it does and if the engine is doing fast what did they do in the inv visual calculation that made it slow something else here that I would like to maybe explore a bit more too is if you’re in the context of a single visual there may also be something here around can I do SVG type things inside those in visual calculations as well add graphics or images or line charts or little bars that I need to inside the context of that cell in the visual so that’ be another one to also

8:48 that’ be another one to also explore too but again you’re now really really stylizing one particular Visual

8:52 really stylizing one particular Visual and I think the extent here is you’re not going to reuse that calculation across across multiple visuals on the page yeah I the other so I I think a lot of our conversations are always bent towards what is a business user building because we’re going to need to make sure that we can scale it up and it’s not a bad thing it’s constantly going to be the central bi or the the team’s responsibility for performance Etc sure

9:22 responsibility for performance Etc sure but one of the things with visual CS that I think is crazy like amazing is it unblocked some of those really big Cliff Dax things for new users where it’s the running it’s a joke right but right after your normal summer average somebody goes that’s great I’d love to see that over time or how can I compare that and then you’re instantly into time calculations so so yeah the fact that the

9:53 yeah the fact that the functions that they’re supporting in a different way right in visual Cals I unblocks that and instantly brings I I think a whole lot of power for end users to get a lot of value in the the insights of month over month or year over year or whatever that is in comparing data sets that otherwise has been extremely difficult and they like many don’t power through the the Dax

10:23 many don’t power through the the Dax that you’re just automatically thrown in to be able to do that and I I think the value alone sits there there and while I agree like certainly it remains to be seen how we figure out how to injust or like just understand what’s going on on a report page right right as a whole bunch of like Community projects where we’re like dialing in all the components of a a powerbi report hey it keeps

10:53 of a a powerbi report hey it keeps keeps us interested right I think I feel like the only miss that I see right now here for official calc calc is the ability to add a library of visual Cals that you can pick from and use from from your central bi team I think to me that’s the only the major missing part that I see here is there’s going to be a lot of really creative features that are people are going to build with visual Cals it’s people are going to want to extend it it’s not going to be just a simple simple like oh column one minus column two right there people want examples of things that are there and

11:23 examples of things that are there and the fact that there’s no you have to literally go find the formula copy paste it in and then use it that to me that’s yeah I think it should be more open-ended to the community to figure out okay here’s for example Marro’s going to Marco Ruto is going to put out I don’t know 30 40 I don’t know how many he’s going to have examples of how to use these things all over the place so I should be able to pick from his examples and then easily load them into my visual that’s that’s what I want because I don’t again can this is what I

11:53 because I don’t again can this is what I find a lot with Microsoft’s new products or new features I don’t understand how to use it so I have to spend time learning it so I need to accelerate my Lear learning path to get to use the feature and then I can understand okay this makes sense yeah awesome with that let’s jump into our main topic for today then so the main topic today is from Sam I’m gonna maybe not do the last name the best de Brun yeah there you go that’s how it hey okay from Sam and

12:24 that’s how it hey okay from Sam and we’re talking from this is from Sam’s blog it was published in 2023 Fabrica lake house or data warehouses so was I think it was October 19th last year when the AR came out but then talking through a bit more what’s the difference here so Tommy maybe give us a little brief summary of the article and give us a a overview of where you want us to start talking about things here no Sam’s been doing an awesome job his his Blog has said great fabric things especially when it comes to the DAT engineering

12:55 to the DAT engineering under understanding The Lakehouse so he’s been on a little here is just about what the difference is between the data warehouse and the fabric Lakehouse so this article is all about well which one do you choose should you choose in between one or the other or and when do they make either sense together or sense apart there’s some actually really intriguing things that he was talking about what a lake house is in terms of what we’re kind

13:26 house is in terms of what we’re kind house is in terms of what we’re I of I think the conception of maybe it’s a blob but it’s the back end of it the engine that runs a fabric lake house what we can do with it so I think personally for me I’m I’m looking at this this and d man this this is still very open-ended on business teams and the self-service side but we’ll get into that but honestly just to give that summary article which on you choose use cases for using both really what is

13:56 use cases for using both really what is the difference from the engine point of view view let’s oh go ahead Seth well I think before we dive into like what’s the difference in in fabric like let’s talk about what is that’s where I was go in the individual tools out there PRI prefabric prefabric right where a lot of that well the vast majority of the the major differences revolve around the backend systems where you have Lake house which

14:31 systems where you have Lake house which is predominantly driven I think think by spark engines right where you have a distributed compute on and over the top of Delta Lake tables or which is what I’m familiar yeah data bricks is we’ve been doing this for four or five years distributed compute over unstructured parket files that have structure because of the Delta parket format your interface with a lot of these this

15:03 interface with a lot of these this data typically you’re in this route because you’re consuming a lot of unstructured data that you need to transform in that platform and it’s easier to do that with a the distributed notebook system such that is presented and writing into Delta tables reading from those in multiple different various languages and this is is a really high level inatur way to describe these things but the SQL DW is

15:35 describe these things but the SQL DW is flattened data it’s completely different storage from the standpoint that predominantly most all of the interactions that you or I have are Azure SQL based right so you’re in memory you do not have the same type of distributed systems way like spark clusters work and it doesn’t handle the the unstructured processing of data in the same way so it’s fantastic and always has been for very well-formed very

16:06 has been for very well-formed very structured data to move from a structured place to another structured place for place for consumption so I it’s all about I think the differentiation between those systems and when they are best used there are there are always ways you can stretch the bounds of SQL DWS they can go extremely large but they still can’t go as large as as what you can in the big big data unstructured

16:39 data unstructured ecosystems of of spark spark so maybe I want to I want to pick on your comment there not in a bad way Seth but I just want to maybe like understand a bit more where you’re going with some of these things right you’re talking like traditional versions of SQL correct mhm okay like so the reason I’m asking the reason I’m making that statement there is cuz believe Microsoft went through a period of time where they’ve tried to change how people view SQL and they went through this evolution of the MPP architecture for SQL so SQL turned

17:09 MPP architecture for SQL so SQL turned into SQL data warehouses which was this Mass was it massive parallel processing yep MPP so massive parallel processing so now you get all these different SQL engines basically it’s like a chain of SQL servers that act more like spark where I have a lot of executors that can go R queries and I guess also my my mindset originally with SQL would be is SQL has it own had had a contained compute and storage engine all combined together right so it’s like this the thing that was all

17:40 it’s like this the thing that was all bolted together and there are still different flavors of azure SQL right

17:46 different flavors of azure SQL right yes so I I think this is where the distinction is different now because when when Microsoft made synaps they were still using the MPP architecture in SQL but inside cups they started doing this okay well SQL data warehouse is too expensive to run so what we’ll do is we’ll figure out a different way of charging you or billing you for how much how many files you touch and this is actually a very Google type approach so in Google when you use like Google Cloud to write SQL against things you can write a SQL statement it hits files and it counts okay how many files am I going

18:17 it counts okay how many files am I going to hit and what’s the size of those files and then it charges you on the read of the number of gigabytes per query so every query cost you something but they just basically own the architecture the infrastructure and say look I’m going to own this and I’ll return you results for a certain amount of dollars right so you get like you of dollars right so you get like I don’t know what it was it’s you know I don’t know what it was it’s you get like a terabyte for free know you get like a terabyte for free and then anything above that you get charged know a buck per terabyte or something silly like that I don’t know what it is right but the idea is like now the Microsoft had the same thing I think it was like $5 per terabyte red

18:50 think it was like $5 per terabyte red of information so it would read the files it would track how much data was being read based on your queries and then from that it would then charge you out how many terabytes of red data that was so to me that was like a good stepping stone so that way there it’s always on it’s this more SQL serverless experience I don’t want to pay for the server and to me the big mind my mindset shift here is there was a a payment strategy change in these new SQL server list type engines

19:20 these new SQL server list type engines and then under the hood Delta or spark itself isn’t the technology that’s super slick here I think it’s the delta table honestly it’s par yeah right so we’re talking about like storing or compressing lots of data I think the challenge at least what i’ I’ve observed is in order to get fast performance you want to be able to store a lot of data in compressed formats and query data while it’s compressed right so if I need to read 300 gigabyt of data versus 50

19:50 to read 300 gigabyt of data versus 50 gigabytes of data anytime you’re reading the smaller amount of data it’ll be faster right so to me the the the big gap here the big technology leap is is in SQL it’s traditionally thought of a as a row level store you’re storing every row of data MH and in datab bricks or spark or the more let’s be very specific here the Delta format that spark reads is in parket format which is a colum store so to me that’s like the big aha moment for me was like oh it’s not just SQL versus Spark It’s colum

20:21 not just SQL versus Spark It’s colum store versus row store and that was the big difference for me so I think there’s a there’s a lot of patterns that evolved out of this row store no one one note in this article that I take a slight issue with so that we’re talking about like what is the SQL data warehouse right the SQL data warehouse is now more of this SQL serverless thing it can read the lake house and it can and in the article it says I think towards the bottom here it

20:52 it says oh in in the bottom here the python side so the lake housee and python the lake house and python is a there is no statement like insert delete or create table as when you’re talking about the lake houses I’m not sure I agree with his assumption there around the Lakehouse I’m pretty sure you can with a Spark engine or the squl engine you can read and write tables inside the lake house so that’s actually what he goes

21:24 house so that’s actually what he goes into right after and that’s Mike I was going to bring that up too he makes the argument that hey a warehouse and fabric is actually a data lake house on the back end U because the tsql abilities all the normal tools and for the Lakehouse makes that argument that you to read only you can’t use the statements like SQL like insert or delete but that’s where the Apache Spark engine comes in and pandas come in but to use SQL to write I think

21:58 but to use SQL to write I think that’s what he’s referring to at least in the lake house there’s no sequel there’s no fabric spark squl statements that let you insert or update tables I’m pretty sure that it’s available you can’t do it in Sequel yeah I’m sure you can insert or join sequel correct yeah yeah not transact SQL but that’s that’s it’s a different engine it’s splitting hairs yeah that’s like why would you to me yeah to me the comment to me the comment feels misleading is what I would say I it feels misleading you’re like well if we’re only talking

22:28 you’re like well if we’re only talking transact SQL yes it doesn’t have the exact same language as transact SQL but it has spark SQL which does these things that he’s asking about insert delete or updates so I to me it’s a moot point it’s like I don’t even understand why that statement lives there because you can still do those things inside spark SQL so I don’t I don’t agree with that with that comment so anyways beside that point right they’re other than that they’re basically the same engines they’re basically able to do the same

22:58 they’re basically able to do the same things right they’re both able to query the data read it out and they’re able to insert update or write data back to the Lakehouse using that same Delta format so here’s something interesting if you want to go into the weeds there’s a PDF that’s actually the engine like is Fabric lakehouses and the back end just synapse and it’s really not apparently Microsoft’s been working Polaris they pulled all of fabric is

23:30 of fabric is so i’ I’ve said this one before and I’m pretty sure this is how I look at it I think of fabric has now the best parts of synapse reworked retooled a little bit made a bit easier because they they’ve redone the UI on some things but it when you when you saw original initial things of it it was like powerbi synapse data engineering it was synapse something so they change they even changed their labeling of things when they started out doing things with fabric so now that they’ve gotten rid of the synapse portion or the labeling of it they’ve just dropped all

24:01 labeling of it they’ve just dropped all the all the language of synapse pieces and so now this is just synapse merging into powerbi and now called it Fabric in in addition to a lot of other new things that they’ve added too right so it’s not just the synaps engion they’re pulling forward so I would agree with you Tommy like I think yes this is just synapse moved into into powerbi now so as we dive into them Sam goes into his recommendations The Medallion approach utilizing lake houses warehouses one thing that I’m really

24:33 warehouses one thing that I’m really thinking about is for the call them the normal normal user is The Medallion architecture because it would be so in sense easy to do in fabric for still for those Advanced users or is this a would you consider utilizing The Medallion architecture utilizing The Lakehouse and the data warehouse for self-service just like we’re doing manage self-service now do you see any potential there because that I think as

25:03 potential there because that I think as we see all these yeah it’s all in fabric this is this is great again the Delta Lake all the things that you talked about for a lot of people Mike like that that probably goes over their head if they’ve just been in powerbi or they’re coming from the Power Platform obviously are data Engineers we welcome you but for a lot of people in the powerb world is that going to be something that I think is expectation I think what you ask I think what you ask is a good question so let me let me give you some context on where

25:34 me let me give you some context on where I think this is I think it’s a relevant question I think there are some nuances to what you’re asking about right right so one thing I think I have to be mindful of here is if I say hey I’m coming at I’m coming at fabric as a the lens of a SQL Developer when I come at fabric from a lens of a SQL Developer I think about staging tables and then final tables I’m going to bring the data and I’m going to Stage IT prepare what I want to prepare get it right and then put it into my final table so I think a

26:04 put it into my final table so I think a a SQL Developer would think that way and stuff you can confirm right from the SQL engineer I would show up make those tables and then produce the final tables that I would do my data in then if you come at this from a business user standpoint they have no context of staging and final tables they have no concept of bronze silver and gold so that the business user doesn’t really have that concept now if you bring in a data engineer who comes from spark they would have the context of

26:34 spark they would have the context of bronze silver gold because that’s that is more of a spark big data type comment and to me the reason why you do SQL staging and final tables is because you want to blow away the data in the staging tables you’re not looking to keep all the individual records because in the SQL Server when I had storage and compute inside the SQL engine it was very expensive for me to have all the data every time time I’m loading it into my main table so that’s because the storage was expensive relative to the compute

27:06 expensive relative to the compute that I bought on Prem right so I didn’t want to buy a 100 terab SQL Server because that means I had to buy really big machines to make to be able to store all the data so it wasn’t an efficient store of information I couldn’t store the data very easily and because of that volume of data became a problem that’s why we got we tried to get rid of everything we we wanted to lose and that’s I think that’s also where change data capture came from so CDC came from that SQL mentality of okay instead of storing every single record

27:36 instead of storing every single record every time I load it let’s just compare my existing record to the new data that’s showing up every day and only take the changes so that’s where the whole change data capture design I think flowed from right it’s an efficient way of storing every changed moment when record is adjusted it isn’t it isn’t though it’s it’s creating efficiencies between your staging and ruction tables if I don’t have to process an entire day and I only have to process the changes that’s going to that’s going to reduce the time of updating any facts or

28:07 the time of updating any facts or Dimensions down the stream it’s not necessarily I don’t have that data in staging staging tables like how are you how are you building your ETL like how do you repair something if it if it breaks if if you don’t have if you don’t have multiple different like if you’re talking about your your raw and silver right your your first two stages in there one is your historical record of everything from from a a raw ingestion standpoint you’re not breaking everything down especially when we’re

28:37 everything down especially when we’re talking unstructured data like you’re not not completely you’re talking data engineering from a from a from a spark developer standpoint is what you’re speaking to right now I’m talking from a SQL Developer standpoint initially right so in the SQL Developer they wouldn’t store every record on every day for historical purposes would they where else would you have your historical record wouldn’t wouldn’t they just kick that out somewhere else and store it in a separate database God it it’s been so long since I’ve been like but but to your point

29:09 been like but but to your point though like the architecture is changing right so you’re you’re you went to the right place I think right it’s so we think of bronze so this so all this to say let me bring my point back here to like wrap it up okay so I I agree with you from some point at some point in time you would not be able to store all of your data correct all of those things correct you have to be cognizant of the how much space you have and so my point is like you’re building processes to say look here’s the production server I’m going to capture what records were added which ones were deleted and then I’m

29:39 which ones were deleted and then I’m going to look for that information and then I will make a reporting system that is only tracking the changes of those records which is which is is which is an efficient store of Records when you have a limited amount of space space to do these things in and like a final point where they struggled what and still was is unstructured or semi structured corre data yes like I have a I have a a complex Json object inside a column of a cell how do I get that out like that’s not an easy way to do handle that inside SQL and what’s interesting is if you

30:10 SQL and what’s interesting is if you look at the evolution of development and storage like developers were all over unstructured data because they found that they could build Json objects systems that were extremely fast without the need of SQL servers tabl right so while you obviously can’t get away from that in certain scenarios where pure hardcore like you need accurate data all the way down to like there are no

30:41 all the way down to like there are no mistakes like SQL those systems are are better than anything else out there right even even in terms of there’s there’s too much that can be done in an unstructured realm that for High what what’s the word high security High you what’s the word high security High accounting level things tables and know accounting level things tables and sequels still where you want to be I sequels still where you want to be that’s probably why you see most of mean that’s probably why you see most of the banking industry suck jumped on tables and have all their like that’s financials you can’t be losing money

31:11 financials you can’t be losing money anywhere but those are production systems right we’re in the realm of hey we can we can go but but to your point though Seth like I think you’re really I think you’re hit a note here that really works well like when you had like when the when the whole world ran on like backend systems backend servers SQL engine and structured tables right that that was yes that made sense but all of a sudden you get this thing called an API showing up and everything is API based and we want the websites to perform faster and so the API says look I don’t need a table anymore actually what I want is a

31:42 table anymore actually what I want is a Json object and so that’s where we start getting a really heavy like where I see most of the unstructured data showing up it’s coming out of apis it’s It’s API based data unstructured semistructured information the ability to index them and quickly find the file that you need and make an update yep and this is where I think Technologies like mongod DB Cosmos database like these are these are tools that are using that a store of files and I can query across them and return results from those files so now the whole databasing architecture can go

32:14 the whole databasing architecture can go to a higher volume of files but they can do it at a much cheaper rate to act like it takes less compute to get the data out and I think that’s shift has changed a lot of how we build things and also the concept of apis so let me finish my point around the whole sequel lake house Tommy let hold on your point Tommy write it down so so you don’t forget it because I I’ll let you jump in here all right my whole my whole pre my whole premise here around SQL or Lakehouse was we need to educate new users to this

32:44 was we need to educate new users to this concept of what is a medallion architecture now people will have a lot of opinions on what it should and should not be ultimately what I think it is is your raw or your bronze layer is where you land the initial files and you just try and get something you can read and look at the silver layer however many layers of silver that you need people think it’s like oh it’s just one table no it could be many tables that’s stack together to build anything in silver but silver to me is like the cleaning the transforming some people call it

33:14 transforming some people call it transformed right some people will say raw transformed and then curated right those could be the same levels of data what you’re trying to do but what you’re doing is you’re not copying the data instead you’re enriching the data you’re you’re providing a pattern to that data maybe you’re doing change data capture maybe you’re doing remove duplicates maybe you’re cleaning some records in the but that’s silver and the reason I think that’s so important is because the silver area is if we’re talking that that is the cleaning area when I have problems with bad data or I find that some bad data got through to my powerbi

33:45 some bad data got through to my powerbi reports where should I be looking to solve that problem I should be looking at the quote unquote silver tables where I’m doing the transform so one thing about this whole Medallion architecture is it it requires you to have designed areas to do certain activities that way when something breaks where to look so to me it’s like that’s really important and then finally the gold layer is okay now I’m joining multiple tables together I’m doing a lot more of

34:16 tables together I’m doing a lot more of facts and dimensions and I’m I’m basically building the data in a way that would say I’m happy with how this data is transformed I’m ready to share this externally with other users and we can then show a pack path of here’s how I brought in the Raw data here’s how I transformed it in silver and here’s the final output of data to Gold sorry that’s all I wanted to bring out but to me this is like I wanted to point out that this is a new thing that data Engineers from spark are trying to do but the business user has no clue of any of this and so we just have to teach

34:47 of this and so we just have to teach them this is the better way to do and this is how it should work in this new fabric fabric ecosystem so I’m want to play the other side as I think about this not not that and Ma caveat here I am not arguing against the medall architecture that has been tried and tested for large data for in the ingestion we know this however take the last eight years of powerbi development power query maybe data flows sure maybe some tables the clean

35:19 sure maybe some tables the clean digest and obviously with then just focus on the semantic model so we’ll say

35:25 focus on the semantic model so we’ll say these people that we’re talking about or that kind 80% of that audience that is that type of user per great in powerbi great in Dax really understands model development Kimble method all the things they’re great so now we’re just adding to their normal workflow that they have already been rapid development and managed their gold models more tools and more steps and more languages that they the

35:57 more languages that they the organization is probably just not going to start hiring people who are well experienced in pipelines or in synaps that background they’re going to go great we want fabric cool and then I think the biggest question I really do have is not so much the user struggling but how much do you add to a process that potentially was I don’t want to say wasn’t broken but there are use cases where I can see I’m like man if I had this when I was here this would

36:27 had this when I was here this would have been great right but I also have to have that assumption that I know the apis if there’s no one else there to do that I have to know that either notebooks I could do power query in fabric but there’s a big cost where I wasn’t spending that before so I think with The Medallion architecture I would make more of an argument that for a good portion of users or scenarios that I think is we’re all assuming is

36:59 that I think is we’re all assuming is lower right now the really the data warehouse itself creating that and just being able to push and store that data could be a little more useful right because we’re not introducing pipelines and files where maybe there were files but the focus is just getting into a table and making sure that data doesn’t change and we know we can do the appending these are a lot of Concepts again so then we at least just take one step back in our source of

37:29 take one step back in our source of Truth so to speak from the table and what they can control if we’re adding all of this for every not again not for every report we’re not going to migrate every semantic model to Balian architecture but I I I’m fearing and I think the the argument I’m making is the other side is everything you said is the right way to go how data Works in general stop however for so many things that powerbi analysts are doing

38:00 that powerbi analysts are doing centralized bi teams or more importantly manage manage self-service what are you expecting them to them to do well well I don’t understand why we’re talking about that audience so let me let me jump in here before you let me see if I can clarify what you’re saying here Tommy let me hear what you’re saying and then you can tell me how off base I am on this one you’re asking about I have workloads that exist in powerbi already today I’m already loading dat in using nothing but import I’m I’m figuring it out I’ve got

38:30 import I’m I’m figuring it out I’ve got things coming in what what is the path to migrate those users into a medallion type architecture and do they need to always be using Fabric and things that are fabric esque such as bronze silver gold and loading data that way so is that what you’re asking that’s a th% okay so I guess question the answer I so let me give you my answer and then I’ll let Seth you can chew on your answer a little bit here too so what is the migration path for existing power users just because fabric showed up doesn’t mean

39:01 because fabric showed up doesn’t mean you need to stop everything you’re doing and move over from what you were doing in previously in powerbi you can still leave those the way they are where I think this becomes more impactful is if you’re an organization who already has designed cloud-based data system practices you’re coming out of snowflake you’ve already got data bricks you already have things that are already in the cloud you’re already you understand the cost of those things and you’ve already Justified the value for those cost there’s no need for you to change you don’t have to change anything you’re

39:31 you don’t have to change anything you’re doing you leave it alone right you just keep doing what you’re doing and you just use powerbi. com for semantic models importing data and using reports I don’t think you need to change that however if you’re a company who’s adding a new data source if you’re a company who’s now trying to integrate with cloud-based things for the first time you’re now purchasing Salesforce or net Suite or something else that’s in the cloud a service and you don’t have a cloud presence today

40:01 presence today currently then I think you’re more apt to do fabric and I I talked to a lot of other MVPs around this and they said most companies who are starting to use fabric are companies who never had Cloud before and are now starting to dabble in it for the first time by obsolescing some of their homegrown internal systems in favor of I’m now building or buying software that is cloud-based and I need to get the data out so to me me that’s that is the decision point for when should I really seriously consider using

40:31 should I really seriously consider using Fabric and there may be some existing companies that do have cloud data and they they want to migrate off of their old systems but it only makes sense when it costs me less than what I’m doing right now and so I’m not sure there’s a compelling story yet for if I own data bricks and I’m already doing all my data engineering somewhere else I don’t think it makes sense yet from a pricing standpoint to fully move over all of my existing workloads 100% into Fabric and migrate them I think there’s a certain level of this you have to kind

41:02 a certain level of this you have to kind a certain level of this you have to evaluate what makes sense and what of evaluate what makes sense and what doesn’t and again to be very clear stuff you’ve been doing for the last seven eight years in in powerbi they just made need to live on right that’s okay you can just let those things live on not an issue and then when you need a major upgrade that might be the time to say okay do we need to take a relook at this is there a faster better more cheaper way to do this and one of the Point that’s not made in this article that Sam wrote was you also have to evaluate how

41:32 wrote was you also have to evaluate how many compute units does it take you to do the same workload now I don’t know if that’s different or not but I would I would argue potentially here there’s going to be a different so if you think about how these different engines are talking to the lak house the SQL Engine versus the python engine they’re using different languages one might be using Java one might be using C one might be there’s like there there’s engines that have different languages that access the lake inherently each of those

42:02 lake inherently each of those languages will have a different speed performance so it’s like you do any kind performance so it’s like you do any speed test there’s certain languages of speed test there’s certain languages of code that run faster than others to do certain jobs it’s the same job but they run differently you have to go in and if you when when experts go in and have to do like okay large data sets accessing lots of data verify that those workloads cost you a certain amount of stuff what what is the number use for those those activities and I think that will help you decide which one of these are better I would liken to

42:34 one of these are better I would liken to argue and again I don’t have any data back this so this is just my pure speculation right I would like him to argue right now if you had a very large New York Taxi level data type stuff and you start ripping around on top of the SQL Data Warehouse on top of that and you compare the same queries or getting the same data out using the Spark engine I would think the sparkk engine will cost you less cus than the SQL engine because I believe there’s less overhead in spark to go get the data versus the SQL endpoint now I don’t

43:05 data versus the SQL endpoint now I don’t know if that’s true or not so anyone MIM if you’re listening I’d love I’d love you to go do something like that me show us maybe you’ve done it already I haven’t seen it but send me the link in the chat or put it on the video M when you if you hear this at all but I’d like to see some comparisons of here’s the same workload done two different ways and then a SQL engine cost you a th000 cus The Spark engine cost you 950 right whatever that is right to me there’s something there that would that that would push you one way or the other

43:36 that would push you one way or the other and as that data volume size grows maybe there’s a a a widening gap between what SQL could do and what what python can do or or spark can do and I think that would be maybe your decision Point later on but that’s like super tech that’s for a data engineer right in most cases I think business users are not dealing with Hyper volumes of data like to the tune of billions of rad of Data Business users are dealing with like thousands of rows of data right hundreds of thousands of rows millions of rows of data right that’s a space where I think they would be very comfortable and at that point

44:07 be very comfortable and at that point the diff the deviation between the two different engines may not be enough to even make a difference it’s like meaningless right it’s not even worth the time just use what works best for

44:15 the time just use what works best for you you right so a lot of sorry you’re you’re asking a very simple question Tommy and I’m really like diving into the thought and that’s how my brain works when I’m thinking through like what engine do I care about okay so from from the I’m just going to recap in into this one sorry I said a lot of things because well it’s it’s the Tom you’re bringing up the business user like the user the analyst who’s who’s already doing the stuff and I guess the summary fits in here before

44:47 guess the summary fits in here before I I answer we when comparing lake house versus data warehouse and the ways we’ve described the two systems and then how they’re going to work in fabric like ultimately The Lakehouse offers much cheaper storage without capacity limits adds a lot of the day data warehouse types features but also incorporates a lot of the flexibility that was not there in warehouses and obviously like areas of like trying to drive for cost

45:17 areas of like trying to drive for cost efficiency I think fabric is bringing all of the Microsoft Services together the key Lynch pin that we’ve talked about is all of those services are now over Delta tables right so you’re you’re introducing that cheap layer of storage or the structure of those tables on cheap storage of azure data Lake storage so that you can work with unstructured data you have access to EAS ml you have

45:48 data you have access to EAS ml you have all the engineering things you have the cheap storage you have flexibility of languages all of that is for like the hardcore de better systems all in the same place to create the gold standard right all of those things are the Enterprise systems that you would need to do that but for business users you still get Delta tables regardless of Medallion architecture because that’s not part of the conversation even though they’re

46:19 the conversation even though they’re doing it they’re still doing ETL right they’re still connecting to sources they’re just building everything in that report layer and I think that falls into the conversation of hey when that gets to a point where that’s being widely shared throughout an organization or has a ton of value business intelligence teams take a very interesting like a a lot of interest in what’s going on there right especially if it’s not performant as

46:50 especially if it’s not performant as well so there is no change for them they don’t have to learn all of these other things they’re being like in this way I don’t while I would agree with you in many cases of like while we’re really like throwing a lot of information at these users I don’t think this is one of those because it’s just oneel Delta tables for all of the stuff so you still get this the scale you which they they didn’t have before you still have cheap storage and they already have their gold

47:20 storage and they already have their gold because that’s what they’re building like ultimately their their little microcosm of connect trans form and now I have my model is the same thing that’s being done on the but albeit slower and with more data in these other bronze or silver layers as part of the other aspects of of fabric and other users that are going to be doing those engines that’s my that’s my perspective I think it’s a good one I

47:50 perspective I think it’s a good one I like that one a lot I think I think it does a good job summarizing If you don’t agree you ask the mean if you don’t agree you ask the question no I I do agree I’m talking to Tommy oh go ahead sorry he’s making faces and he’s just like I honestly it’s just talking to people who are also nerds it’s just it’s just so nice I don’t know who else is talking about this at in the morning no one because everyone else is sleeping at this this point so any reactions Tom me before I you want to give like some final

48:20 you want to give like some final thoughts on this one honestly I’m I’m really looking forward where this is going I love the Articles make sure to check out Sam’s article and also there’s a link to his previous post where he really dives into the Polaris engine deep dives into really the the back end which is pretty interesting pretty interesting so there’s so much to learn but no you guys are we’ll we’ll talk about governance as we continue go in the future but I will so

48:51 continue go in the future but I will so let me get I’ll give you my final thoughts here so governance is another whole B wax that we should need to talk about too cuz I think the engine Choice also helps out with governance pieces as well at the end of the day right it has to be a progression I I fully I’m not sure the data engineer and the data scientist are running to fabric saying this solves all my problems or does everything I needed to do they already have tools today that they’ve been doing data Engineering in that they’re probably comfortable with today currently however for the crazy amount

49:22 currently however for the crazy amount of business users that have been using powerbi this gives a huge amount of opportunity on tool tools that are Enterprise grad so I really like this this story this article when we talk about this is the growup story this is what we were asking for powerbi to do a while ago grow up do something like you’re going to use a power query to get data into something let me store it let me hit with anything you want and again I I’ll keep leaning on this one the magic The Secret Sauce here that we’re talking about it doesn’t matter if it’s Lakehouse it doesn’t matter if it’s

49:53 Lakehouse it doesn’t matter if it’s spark SQL custo I don’t care what the compute engine is all of them talk Delta and use the one Lake yeah to me that’s the really value ad here again it’s the value ad of it doesn’t matter what engine you’re using use whatever you’re comfortable with and Microsoft will make it optimize in a way that all can talk to each other using the the powerbi side one thing I will I I do take issue with a little bit in this article here at the very end is inside this example at

50:25 very end is inside this example at the very bottom he said when to use what when to use what at the at the endpoint and there’s files tables and a a SQL data warehouse so bronze is files silver is table based formatting and gold is a SQL data warehouse at the end I think this is just one of many patterns you can use so I think I think it did a little bit of a disservice by saying this is how I see it running I I think it’s worthy to note that this is one way to think about how to do your data because this doesn’t account for direct

50:55 because this doesn’t account for direct Lake which I think is an incredible Game Changer that means yeah so direct Lake allows the vertec engine to directly read the lake house to me huge thing huge amount of improvement and if I don’t have to spend time every day Computing a load on a table I can shift my compute work that I would have done for loading for hours at a time I can move that over to data engineering and shape my data Upstream to get it ready for loading in a more efficient way so I

51:26 for loading in a more efficient way so I think think this is a this is a early look at what the the potential architectures will be and we need more real we need more examples and more time to really build out what we want so good article I think good thoughts use what engine makes sense best for you and when you get to really big scale data that’s when you really need to evaluate which’s the right engine to do the job because I think that’s when when when you start talking large amounts of data you now have to seriously think about the engine okay with that you’ve burn

51:56 engine okay with that you’ve burn through a perfectly good hour of time thinking through and talking through all the things here we appreciate your time thank you so much for listening to the podcast our only ask is if you share it with somebody else with that how many where else can you find the podcast you can find us in apple Spotify wherever get your podcasts make sure to subscribe and leave a rating helps us out a ton you have a question idea or a topic that you want us to talk about a future episode head over to power. tips podcast leave your name and a great question finally join us live every Tuesday and Thursday a. m. Central and the join

52:28 Thursday a. m. Central and the join the conversation all powerbi tips excellent social media channels social media I all right thank you all and we’ll see you next [Music] time

53:06 [Music] n

Thank You

Want to catch us live? Join every Tuesday and Thursday at 7:30 AM Central on YouTube and LinkedIn.

Got a question? Head to powerbi.tips/empodcast and submit your topic ideas.

Listen on Spotify, Apple Podcasts, or wherever you get your podcasts.

Lakehouse or Data Warehouse – Ep. 305

Main Discussion

Looking Forward

Episode Transcript

Thank You

More Posts

Git Best Practices Diff Noise & Naming - Ep. 513

Publish to Web vs. Embedded - Ep. 512

Scaling a Power BI Side Hustle - Ep. 511