Data Ingestion Part 2 – Ep. 223

Data ingestion is the start of every analytics project—and the place where ‘quick wins’ can turn into long-term technical debt. The real challenge isn’t pulling bytes from a source system; it’s choosing an ingestion pattern you can run repeatedly, monitor confidently, and hand off cleanly.

In Ep. 223, Mike, Tommy, and Seth continue the ingestion conversation with a Fabric mindset: when a pipeline is the right fit, where dataflows shine, and when Spark/notebooks are worth the added flexibility. The goal is simple—pick the right tool intentionally, then standardize the pattern so teams stop reinventing ingestion on every project.

News & Announcements

Fabric decision guide: pipeline vs. dataflow vs. Spark — A decision guide for choosing the right Fabric ingestion starting point based on workload, transformation needs, and operational constraints.
Power BI Theme Generator — New icons and theme uploads for keeping report styling consistent across teams and projects.
Explicit Measures Podcast — Subscribe and browse the full episode backlog.

Main Discussion

Topic: Data ingestion patterns (pipelines, dataflows, Spark)

Ingestion decisions often get made under deadline pressure (‘we just need the data’), and the consequences show up later: brittle refreshes, unclear ownership, and inconsistent patterns across teams. This episode is about defining decision criteria up front—so you choose the right ingestion tool on purpose, and then reuse the pattern across projects.

Key takeaways:

Use a decision guide to force the right questions early (volume, latency, transformation depth, source constraints, and downstream consumers).
Pick the simplest ingestion tool that meets the need—unnecessary complexity shows up later as failed refreshes and support burden.
Separate landing from curation: ingest and validate first, then shape for consumption in a curated layer.
Standardize the pattern (naming, ownership, scheduling, and promotion) so every project doesn’t invent a bespoke pipeline.
Make operations non-optional: monitoring, alerting, retries, and clear on-call/ownership are part of the design.
Keep business logic out of the ingestion layer where possible—use ingestion for movement + light shaping, and curate downstream.
Use Spark/notebooks when you truly need code-level flexibility, but treat them like production assets (governance, tests, and repeatability).

Looking Forward

Pick one ingestion pattern your team can repeat, document the rules, and apply it to the next dataset so ingestion stops being a constant architecture debate.

Episode Transcript

0:31 good morning everyone welcome back to the explicit measures podcast with Tommy Seth and Mike good morning gentlemen on Tuesday happy Tuesday to you both and to you this morning I feel like in general right after Microsoft build there’s a lot of great announcements and this seems like it’s been a pattern for the last two years they’ve been doing a lot of announcements around power bi at build build and leading up to build there’s like not

1:01 and leading up to build there’s like not a lot of features being developed because they’re holding them all back until build occurs and then build occurs and we have all these new things to discover and it’s always exciting the next couple of months I feel like after build because now you get all the some of those features start getting rolled out into desktop or the other different applications which is fun do you guys feel the same about this is just do You observe the same excitement after build there’s like additional features and things coming out

1:33 what they talked about fabric was like a big announcement and this is going to be a long rolling Discovery yeah that that in general has really changed a lot of the landscape around what power bi is doing and you around what power bi is doing and I’m getting a lot of questions from know I’m getting a lot of questions from people about this is going to change what is this doing doing what’s what’s going to occur now that we have we have all this extra data engineering tools inside power bi but I’m just thinking about in general it just feels like the the blogs that come out of Microsoft features that that are coming out that are just not as fabric related they’re

2:04 are just not as fabric related they’re now coming out now and I’m hearing they didn’t hear about these other features from from like reminder for a couple things talking about some of the feature updates in May they’re not talking about power being embedded with Microsoft fabric the retirement of streaming data flows which I don’t really ever really use that very much so not sure how that was doing there there and then I did a POC along right right when streaming data flows came out it it was really cool like we had we had a guy

2:35 was really cool like we had we had a guy who was very into actually custom building electronics and everything and saying signals yeah so he he proved out this low car locator thing right and then did a did a whole experiment room where he drove to oh cool I was in like we tracked all that and it was like all real time and it was amazing but yeah I don’t know how much the the real mean I don’t know how much the the real use case scenario I guess just didn’t pick up or take off there but I agree obvious and it makes sense because Bill being the

3:06 it makes sense because Bill being the major well the major developer conference ironically enough with this it’s very business Centric right which is not really developer-centric very lost around that yeah but a lot of yeah a lot of all the there’s twofold one is and I I don’t know if you’re making this point there are there are the big exciting features that start getting rolled out as well as like the new additions but one that I wanted to talk about today with a Blog with Chris Webb was

3:37 with a Blog with Chris Webb was all of the ones that were released that nobody noticed up until the lead up because they had it almost like they had to deploy certain parts of this into the ecosystem without like setting off the hey something brand new and amazing is coming down the road and I guess the reason why I’m saying this is I just discovered via a tweet on I think it was on Twitter from Armand talking about there’s a brand new visual in the May version of Microsoft power bi desktop so desktop is now out

4:08 power bi desktop so desktop is now out you can go download it today sorry it is now the June version of power bi desktop desktop version 2 118 is the new version it’s out you can go look at it inside there we have a brand new visual a new card visual visual that that if you’re trying to build in all the a lot of reports have just a couple kpis across the top of your report like one kpi two kpi across the top this automatic card which will let you drop in multiple measures and you can build multiple kpis all in one

4:38 can build multiple kpis all in one visual so it takes like multiple cards and matches them and makes them one large card which is cool so anyways it’s very interesting here before the blog even was yeah I know we caught it before the Vlog even came out well that’s okay the advantage of me starting at 7 30 in the morning I guess and Andor and the advantage of joining us live while we discover all these things there you go yes exactly we don’t even know what we’re doing yet well you are ahead of the game so anyways that’s

5:08 are ahead of the game so anyways that’s a pretty cool I like the feature I like the card I think this is going to be a really useful I I do use a lot of cards on reports and I feel like this will be very helpful as well well so one one of these same ones is and I don’t know if you guys mind it paste the the the link to Chris Webb’s blog until yeah chat window sure but we were just talking how we re-save or save power query code a couple minutes ago yep Chris outlines for us that there

5:39 ago yep Chris outlines for us that there was a new feature that was already introduced quietly yes we saw this yeah Power query templates yes so what’s really interesting to me about this one is is I I think I’m gonna make like a hard rule like every business user has to use power query from now on yes like anything you’re building yeah use power query query but then like the follow so what this is for listeners is essentially it’s a template like you can download a pqt

6:09 template like you can download a pqt file that takes all of your query like all of the whatever queries you generated within the power query or ecosystem and like extract them and put them into a new dataflow Gen 2 in fabric yep and I don’t I don’t know if I didn’t hear that it was recognized by power bi though it is so not desktop but it is in the service so you can be data flows right right yeah I don’t I don’t think it’s in just crazy about this what’s crazy about this why isn’t this

6:39 what’s crazy about this why isn’t this in power bi why isn’t this desktop why isn’t this in desktop yeah you’re like Microsoft get it in desktop please like I know right it would have been a very easy small short conversation add this feature to your desktop like this is in there and we don’t have to worry about it Alex you’re listening listening yes exactly he comments pqt binary hmm unzip it like oh okay no

7:09 unzip it like oh okay no so I do like these things and again this was a I know it’s been around in Excel for quite a long time because when you export things it’s there but it has not been exposed to anything in the power bi realm until recently and that was in fabric data flow Gen 2 is when I first noticed this PQ template and I’ve actually found it from Alex Powers is introduction to fabric so in fabric when you’re doing the introduce introduction classes Alex uses this to get you started inside power query hey

7:39 get you started inside power query hey go download this file from the GitHub go use upload this file here and there it works and it’s just all code it’s just a from what I found you unzip it and it’s just straight code inside there just like a Json object so so there you go Tommy there’s another there’s another nail in the desktop coffin as Alex says don’t waste a single minute on power query in desktop bring power query online to desktop even though this is enabled in Excel when you can export it so yeah well not for you not for you

8:10 so yeah well not for you not for you Tommy not for your tool that you use to need Excel or online only Excel I I create Power Group either way either way cool cool thing read the blog check it out yeah templates and power query get it get them use them yeah I like them as well any other we have a couple other just quick announcements here any other key notes Here Tommy is now officially a Microsoft trainer so we should definitely good job Tommy you’re an MCT is what it’s

8:41 Tommy you’re an MCT is what it’s technically called like what does it stand for Microsoft certified certified trainer so a lot of Hoops you got to jump through you’ve got to record yourself doing like a training and doing it a certain way in order to get the certification from the organization so congratulations on that thanks man no it’s really exciting because all the training that I’ve been doing for clients for for conferences like man this just makes sense but now with the in with the Microsoft certified

9:12 with the in with the Microsoft certified trainer one maybe a little more trust when I go into a known territory like oh maybe he knows what he’s talking about about but in case the other MVP title doesn’t exactly allude to that yeah but honestly now I have access to all the actual certifications in the exams and the lab modules and the sources so that’s really cool because there’s probably there’s three Power Platform courses right now or cert like exams

9:42 courses right now or cert like exams it’s amazing the content that Microsoft has obviously everything with Azure nothing for fabric yet but I’m sure it’s coming coming I agree give it time though that’s that’s exciting yeah tell me when when I want my next certification can I come to you yeah I can help you out I can help you out you out train me up Tommy yep I don’t have anything else really made sure on the Microsoft block that came out I don’t have any other other incredible announcements unless you guys

10:13 incredible announcements unless you guys have any other things you want to pick up on here there’s this one from formu around so Gilbert out of Australia I was talking briefly around creating a hierarchy for field parameters for easy navigation a neat little tip and trick to help yourself out there have you used this before Tommy so obviously you used field parameters but the way the way the way that he did this is actually really neat by Gilbert use field parameters to create a hierarchy for just choosing your measures so which I

10:46 just choosing your measures so which I really yeah it’s just sales and orders like you have I have a heart like pick which things you want to see in a measure right yeah and it’s just one of those not hacks or things that people find out oh I guess you could combined the field parameters and doing Dynamic measures into something really neat for my purely for user experience it’s like I have another little neat tip or trick more of like a user functional functionality like you can just something you may not have known

11:17 just something you may not have known you could build but in reality it totally makes sense if you’re gonna have a list of measures you may want to order them here’s all my sales measures and then yes here’s all my orders measures so a really good article as well well so maybe another one that you want to check out additionally excellent I don’t think I have any other ones that were really major at this point point a double check if there’s any other I oh it’s always hard to keep up with things that’s one of the big challenges like this is why I like using jam. m power bi tips because I we’ll frequently go there and as I find

11:48 frequently go there and as I find articles I just stick them there and I forget that they’re there and so I think oh shoot there’s a whole bunch of Articles here I need to go back and and check them out and see which what did I save and I’m actually cruising through here it’s like a so your five minutes is how fast can I rip through the internet and find that’s interesting exactly right the first parts of articles that catch my interest I’ll add them in there and then when I have 15 minutes which only happens when you’re looking at it in the podcast exactly right you re-rehash what

12:20 podcast exactly right you re-rehash what you put in there there’s so much good stuff Tommy you had a ton of stuff in there like Reza Rad’s coming out with like data Lake versus Warehouse or data Mart people are talking about data governance and Microsoft fabric what are roles and domains like there’s this new there’s this new feature that came out with fabric is a domain like a collection of workspaces it’s good stuff there’s really great articles David eldersfeld is doing 40 days of fabric and he’s starting off with days one what is what is fabric there’s a lot of really interesting content coming out

12:51 of really interesting content coming out don’t forget to check out Jam dot power bi tips we have a new fabric nested collection oh good job Tommy check it out love it so we’ll just put that in there and then I think we have already 36 articles for just fabric let’s go let’s go let’s go let’s get some stuff going I’m sure that will be filling up fairly quickly here as we as fabric people get a round Fabric and understand it and start really building on top of it yeah I didn’t include any of the actual the Microsoft learning content so

13:24 actual the Microsoft learning content so far into that but I think just be trying to navigate Microsoft’s documentation for fabric I feel like their navigation is a little to be desired because you go from the NDM tutorials but then you’re like wait am I in am I in Azure synapse now or am I still in fabric did they just copy replace a name yes exactly yeah maybe so where does this go or anything like what data set am I using so there’s like we talked about one fabric got launched they came out of

13:55 one fabric got launched they came out of the gates running with their documentation and their guides yes so this is a good lead into our topic for today so today actually last week on Thursday we talked a little bit around this thing called okay do we decide between a in fabric you have this ability to make a pipeline copy activity you can do a data flow Gen 2 or you can use the Spark engine each of these areas Target different audiences around what kind is the Persona of that of that area right so a pipeline comes

14:27 area right so a pipeline comes from Azure data Factory or synapse data flow comes directly from desktop that’s something that we’re familiar there and then the Spark engine is this data engineering realm it came from synapse because that was really the only major way you could interact with a spark notebook so that’s another area of investment there so looking at this we’re trying to extend the conversation before because we talked a lot about this one article and we never really got to the different user personas so at the bottom of the article I’ll put the link of the article

14:57 article I’ll put the link of the article right here in the chat window so everyone can enjoy it and read it again or go check it check it out again this is the article and at the bottom of the article there’s scenario one scenario two and scenario three three so maybe Seth maybe you want to take us a bit through do a little read through here on this one and give us a bit of let’s I’d like to talk about the different personas that are here yeah and maybe some gaps that we see in the description of these personas potentially maybe pick them a part of it

15:27 potentially maybe pick them a part of it scenario one Leo a data engineer needs to ingest a large volume of data from external systems both on premises and Cloud these external systems include databases file systems and apis Leo doesn’t want to write and maintain code for each connector or data movement operation he wants to follow The Medallion layers best practices with bronze silver and gold Leo doesn’t have any experience with spark so he prefers the drag and drop UI as much as possible with minimal coding and he also wants to

15:58 with minimal coding and he also wants to process the data on a schedule are we good enough right there right yeah do you think we need to go to the next paragraph well let’s talk about this one because I just got really confused because I started with data engineer which puts me in a realm of your coder okay and then like okay and it’s like he prefers the drag and drop UI and like person at all does he really yeah yes power bi park it hit the other comment I was gonna make which was like my voice just immediately went into like audiobook mode or that’s like

16:30 audiobook mode or that’s like yes that’s what you get I want Seth I want you to navigate my directions on my GPS now okay I I turn left turn left exactly exactly unfortunately these scenarios read very similar to those Chevy commercials whenever they show like are you a modern teenager going to be 25 it’s like yeah I’m 25 and I’m thinking of a kid well there’s this perfect part of the car for you it’s like well I don’t want to be constrained by gas so we’ve added all

17:01 constrained by gas so we’ve added all these new future it sounds like the personas built by the features yes yes rather than it’s all the way around it’s backwards yeah you have to look those Chevy commercials I can’t stand them we’re anyways but you must be a Ford guy then yeah everything he’s like he wants a process data on a schedule he wants to do a Feature Feature Feature but yes yeah and I agree with the when we’re talking about the data engineer because that also goes back to the personas okay everybody’s a data yes okay so so let’s

17:32 everybody’s a data yes okay so so let’s read between the lines here and not get hung up I think and this is maybe what the confusing part is we got hung up in our last conversation around the name correct job title assistant this is a person who’s doing some ETL ETL apparently they’re not they don’t do it a lot or they only do it in UI type environments okay in this scenario I would drop the descriptor of so exactly I love how you said that Seth

18:02 so exactly I love how you said that Seth this I would for sure drop the apis this user does not right understand or use or want to use apis if they’re doing a drag and drop experience and if you’re grabbing database and files files that that feels like that feels like the Persona right I’m a user who’s trying to absorb data from a database and there’s flat files coming from somewhere that I’ve got to ingest yeah so I I am going to read the second part because it just adds to my confusion

18:32 adds to my confusion oh yeah yes okay yes okay so Leo is going Beyond right so here’s more about it so now right so now we know Leo apparently understands databases file systems and apis but still likes to do and wants bronze silver gold so he knows those Concepts but he only likes drag and drop UI in terms of implementation that that’s already a struggle for me the first step is to get the rock the first step is to get the raw data into bronze Lair lake house from Azure data resources and various third-party

19:03 resources and various third-party sources like snowflake web rest AWS S3 GTS Etc because Leo’s can have access to all that exactly he wants a Consolidated lake house because he knows what that is so that all the data from various lob line of business on-premises and Cloud Source resides in a single Place Leo reviews the options and selects pipeline copy activity as the appropriate choice for his raw binary copy this pattern applies to both historical and incremental data refresh both Concepts he knows with copy

19:33 both Concepts he knows with copy activity Leo can load gold data to a data warehouse with no code if the need arises and pipelines provide High scale data ingestion that can move petabyte scale data copy activity is the low best low code and no code choice to move petabytes of data to lake houses and warehouses from various varieties of sources either ad hoc or via a schedule if you are ad hoc moving petabyte scale data data boy I’m gonna have a conversation with you you well okay so okay where are we

20:05 well okay so okay where are we schedule okay so where are we at so ad Hocking scheduling stuff like I I could understand that you’re gonna do like initial load of your so this when I see ad hoc type loading I feel like it comes from hey we’re going to turn off this server we need to go get all the data from it so build a process to load everything from it once that’s like your ad hoc I feel like and there may be other use cases around that but that’s just just things that I feel but to your point Seth Seth we feel like we’re mixing a bit of

20:36 we feel like we’re mixing a bit of business user and we’re mixing a bit of data engineer and I would I think I’d come very closely to your argument of if Leo understands all these multiple Concepts and once again everything he understands the concept of putting all the data together yeah we’re we’re still we’re solidly talking a data engineer who knows how to write code I would I would expect your training to be on that Persona to be more SQL and database and writing stored procedures and so being able to do those things and I agree like it to me it’s not

21:07 I agree like it to me it’s not necessarily like you have to be a coder but a lot of these are really large Concepts yes okay okay like maybe let’s just read down now throughout the through the table this is where where would you use a pipeline copy activity so maybe this is just describing that hey in this use case we don’t have to do heavy Transformations or anything

21:30 heavy Transformations or anything a pipeline copy activity is what you would be shooting for because you don’t have to do any of that lifting and honestly yes the the copy activity itself is extremely easy to implement and if all you have to do is move data from one location to another yeah that’s that’s your bag right like you don’t have to create some custom you you don’t have to create some custom Jupiter notebook or transformation know Jupiter notebook or transformation or data movement process like it’s literally just drop an activity connect to a source put in destination

22:01 connect to a source put in destination and that’s it yeah but the the pipeline copy activity or at least the pipeline has I believe all the activities that Azure data Factory has in synapse correct when it comes to variables when it comes to parameters so yes for a basic copy fine but that’s not one either a the data engineer the people who have been accustomed to the pipeline UI in the past who are they right

22:31 UI in the past who are they right they’re they live in Azure they live in data Factory or synapse yeah so I guess I guess my confusion here is like this this is the entry point point right this is the lowest level of user knowledge base right to to to use something in fabric for copy data flow Etc copying information so this is your lowest common denominator related to you

23:03 lowest common denominator related to you lowest common denominator related to a a know a a utilizing some of these tools so I would argue that you shouldn’t start with pipeline of of the experiences that I enjoyed most so when I learned things I started with power query so I started with data flows basically so data flow is where I started of the user interfaces that make the most sense to be able to build basic loading of data I’d honestly argue data flows is actually a better First Step then to go all in on a pipeline pipelines yes they do have Tony

23:34 pipeline pipelines yes they do have Tony what you said prompt parameters and they have variables but to be honest it took me a little bit of time to get my head around like yeah what was about these yeah I would flip-flop these so I would say and I I would also argue that scenario one is just kind that scenario one is just more about up and again I would I’d of more about up and again I would I’d like to see a Persona here that’s more of like a business user-centric persona it’s all of these women in here there’s a business analyst in dataflow

24:04 or did we miss that in our previous conversation no I’m I’m talking about the scenarios though so that each scenario is Leo Mary and Adam and they’re all data Engineers okay so I don’t think they slipped it in the bottom portion it’s in the table but maybe yeah it isn’t the table Yeah so they have a business analyst in there and that’s where honestly I think I would start with that one right to me this most simplified experience is data flows flows I have data I need to pull it together and I think I feel like when I talk about pipeline activities the pipeline activities if I had a what was

24:35 pipeline activities if I had a what was what was before pipelines it was stored procedures like to me the the most closest data engineering exercise is ssis ssis it’s sis it’s stored procedures it’s like that that system of automation of like loading data in we can now just replace that with pipelines and so that’s where I feel like a lot of for organizations if you have those ssis procedures this is a good replacement for a pipeline to replace those things and orchestrate multiple things together I think this is so let’s lean into this

25:06 I think this is so let’s lean into this one a little bit because I agree in scenario two outlines what you’re talking about Mike and where my misconception went first which is if I’m presenting this to an audience that is absolutely no idea what you you are trying to tell them because yes call it fabric yes give it give it from the business user easiest level to intermediate to professional Enterprise correct yes everything should be in the vein of hey

25:36 everything should be in the vein of hey you’re a curious business user right like here’s step one this is you I like this is the next person this is yes if all the documentation aligned to easy Middle Ground hard yep then then I think the adoption or people who had like business folks who had the drive to like learn more yes and go deeper would resonate would this would resonate more with them and I think it’s a great way to say second time this is the second time we’re talking about

26:06 the second time we’re talking about we’re just realizing oh yeah that’s not that’s not really a line didn’t yeah in in that regard and I like how you’re approaching this Seth because this fabric is coming to power bi you’re already starting with power bi which is a large business a business focused right it’s the business focused type of tool and so now we’re and this is this is the story that we have we’ve been complaining about on the podcast whereas we need more of the grow-up story of power bi and this is this is the support for that

26:36 and this is this is the support for that grow up story and I like this but we have to think about okay who are our current user bases of power bi and it starts with the business user it starts with power query it starts with I’m doing everything in a single file I’m not thinking about Medallion architectures I’m starting with I load data I just drop it into a workspace and done like we move on and so this is this is fits very well with our moniker of act like the business but think like it right so you start building things today add value right now but then you have to think about

27:06 now but then you have to think about what does sustainability look like how do I refresh the data what happens over time those things and if if all I keep cutting off go ahead no no I think the biggest thing for me when we’re I think I’m getting frustrated with the pipelines here is they do this very simple action copy from a folder from an Azure blob and I would disagree that no we can just do this in data flows to try to get multiple files from multiple folders and yeah is not a great

27:38 multiple folders and yeah is not a great solution especially if it’s in a different format it’s really just meant to in a sense move it over like a parquet file or CSV to a Delta table that’s not what even Gen 2 power bi data flows is going to be good at to try to iterate over multiple files in a folder it can do it but when we’re dealing with like let’s say the Azure blob storage right that’s where the general template is here

28:08 general template is here and you have seven folders I think with the the worldwide importers they each have their own yearly files that’s a lot more work to do in a data flow which you I wouldn’t really recommend just to move it over from the Azure blob to the lake house right well so again I’m gonna yes I think I agree with you but I would also maybe bring things up a little bit one layer higher than what you’re talking about I think to the end user we don’t really care what’s under the hood

28:39 don’t really care what’s under the hood the end user doesn’t care if it’s one lake or blob storage or tables or not or Delta like the end user should all this should be transparent to them right so one of one of my complaints with what power query does today is and I think dataflow’s Gen 2 potentially starts to solve this problem is if I have a table of information if I’m going to bring a bunch of information down to my data my power bi environment I have what is the what is the value of a record right now and I have slowly

29:09 a record right now and I have slowly changing Dimensions those are my main two concerns as I work with Data Systems so make it very easy for me to go get what is the current version of a record the key and the most recent value of that record and give me the the other version of that record where I can see okay here’s the value of that record and here’s how it changed and what day it changed on each day but I don’t care about everything else technology wise I just want a table to show up and use it like that’s literally all I want as a business user so I think

29:39 all I want as a business user so I think a lot of these personas I start talking about like the underlying decisions and architectures and things that make more sense for like the the Enterprise but I’d like to start more of that high level I just need a table that’s all I want sure want sure so let’s talk about scenario two in a slightly different yeah let’s go to two I’ll go through Mary is a data engineer with a deep knowledge of the multiple line of business analytical reporting requirements an upstream team has successfully implemented a solution to

30:09 successfully implemented a solution to migrate multiple lobs historical and incremental data into a common lake house Mary has been tasked with cleaning the data applying business Logics and loading it into multiple destinations such as Azure SQL DB 80x and a lake house in preparation for their respective reporting teams Mary is an experienced power query user and the data volume is in the low to medium range to achieve desired performance data flows provide no code or low code

30:39 data flows provide no code or low code interfaces for ingesting data from hundreds of data sources with data flows you can transform data using 300 plus data transformation options and write the results into multiple destinations with an easy to use highly visual user interface Mary reviews the options and decides that it makes sense to use dataflow Gen 2 as her preferred transformation option I feel like part of that was like marketing the whole yes the whole thing felt marketing especially the way you read it yes

31:11 especially the way you read it yes so so by the way this is your scenario though like where game flows comes into play but like what’s the difference between one and two like obviously it’s data volume yes yes they’re definitely and they made a very emphasis on data flows is for your medium to small amounts like that makes sense I would also argue here scenario two I don’t think is very relevant when I’m talking about I need to load the data to multiple locations like Azure SQL adx and lake house again this is one

31:42 SQL adx and lake house again this is one thing I don’t care where the data goes I think all we’re trying to do is Mary should be focusing on loading tables into workspaces that people can touch and use Mary should be focusing on making a handful of data models for the organization but Mary’s working on the Enterprise data so yes she’s got deep knowledge of the line of business logic Mary should be thinking about how what things do I need to produce for the business I’m building reports I’m building data models or I’m building tables of data that I’m going to let another team go consume and use that’s

32:13 another team go consume and use that’s her that should be her role so talking about Azure SQL not relevant right talking about lake house not relevant relevant to me this is all around generating those tables and figuring out how to secure table access for other teams what’s interesting to me though is like so let’s this this challenge is this challenges I guess some of the the who’s doing what parts of this right like if I’m bringing data into an

32:43 like if I’m bringing data into an ecosystem ecosystem it it I don’t know I’m confused we’ll just be there’s a whole team like to me Leo right the team above Leo is the data engineer right Leo is feeding data to Mary like well that’s the way it reads that’s the way it reads there’s another team that is like the central bi team and they’re doing things that are maybe out of higher volume maybe they’re getting millions or or hundreds of millions of transactions a month and they’re just trying to distill this down for Mary but where like Mary

33:15 this down for Mary but where like Mary should be focusing on accessing the Enterprise bi Mary should be focusing on what data sources that the business needs that is not included in Enterprise bi or Enterprise reporting to me that makes a lot more sense here okay hey look I’m gonna grab all these sources from Enterprise BI Solutions whether lake house SQL whatever her Source becomes these internal systems snowflake I don’t care right all those systems are doing some level of data engineering prior to Mary touches it and Mary goes oh by the way I have this Salesforce

33:45 oh by the way I have this Salesforce table that I need I’m going to pull that in so Mary brings in some extra data sources to enrich the data above and beyond what the Enterprise team gives her so the business can do their job like that makes more sense for for Mary’s role would you guys agree or am I off base on that one yeah I guess what I’m challenged with is like scenario now we’re talking about two scenarios ahead ahead of scenario as well is okay I I’ve now been provided a layer of information from a different team that I’m going to

34:16 from a different team that I’m going to go apply business Logic on top of to to what end like do I now have a table for Mary and it’s the same table that scenario three is going to pick up but we’re just going to do the same transformations on an Enterprise level where we need to and those are two different things now and I have business in the same environment creating artifacts that or extracting data for purposes over in power bi like for reporting but then I’m

34:46 power bi like for reporting but then I’m going to have Enterprise and this is where it’s like okay clearly in the Enterprise space prior to fabric I I had a very clear understanding of what schemas what tables what things were doing what for the purposes of Enterprise level reporting and the challenge was taking artifacts from the business that they were creating and when and when needed Elevate them right but now I’m going to have both of those in the same ecosystem

35:17 have both of those in the same ecosystem I I just like how I’m still wrapping my head around like how that works yeah so and like they are all of these roles are mutually separate there’s not I don’t think there’s ever going to be a situation and not to get ahead of one person with one project going pipelines data flows or or notebooks and this person Mike you you had a perfect point of the data flow person especially with the how Rich

35:49 person especially with the how Rich the feature set is in with data flows and now with Gen 2 and we’ve talked about that a grub story with data flows where I can apply more of the business logic connect to different sources and the the end goal of the for any data floor the end goal for that usually that scenario is to structure however the that previous step was the final table the final data it’s not pulling in usually the raw files especially now that we have everything with fabric but

36:20 that we have everything with fabric but it’s it’s the instructions data types is doing some additional logic with the columns it’s a lot it’s merging different sorts like you said like okay Salesforce we need the sales force exactly yeah to finally come in we’re now again not just to only be in power bi but it can get pushed to a lake house can push to for us obviously to for reporting in power bi but the the copy data or at least the pipelines you copy data or at least the pipelines those are these are mutually

36:50 know those are these are mutually exclusive people I think because one the person who who has been doing data flows exclusively now is probably not been dealing too much with at like the synapse pipelines I’m assuming the overlap is not yes I would agree in the same way the other way yep and I think that’s I think this is our Point here right you’re you’re absorbing so power bi does its thing and what we’re doing here is we’re absorbing two other additional roles into Power bi where we’re adding this ability hey we built

37:21 we’re adding this ability hey we built in apps it didn’t go so well we’re now bringing the synapse features directly into Power bi hey we have this data science thing well the data scientists need all the data that comes out of power bi so we might as well bring that Persona or that that workflow into what power bi is doing so these two other things are being added into what power bi is doing and I could even potentially see here like when you start loading up SQL servers and bringing SQL servers into fabric yeah there’s actually the ability to have have not only your entire

37:51 not only your entire data analytical system inside fabric you potentially could even add your transactional system or an app that’s built to be transactional inside fabric because now there’s a SQL server and I I I think of the reason why you want SQL Server around is SQL Server is great for a database that supports an app that’s another element you could just bring right into fabric so now you have potentially another Persona here that’s not being talked about which could be a little bit like an app developer also inside fabric so app Developers

38:24 inside fabric so app Developers data engineers and then the business users so the story I would like to see happening here between Mary and Leo is the grow-up story of the Enterprise yeah hey look we are going to give wider access to the business they’re going to connect to a lot of data sources we’re going to Define what those sources are and we’re going to give those sources back over to Leo in the engineering team Mary’s out and out in front of the it organization conducting the data sources and making value from data Leo shows up and says

38:55 value from data Leo shows up and says hey we’re ready to hand off this data transition show me what you built Mary Mary shows Mary shows Leo all the work that was done inside data flows and Leo says hey I can do the same things I in pipelines and then I can own it and a tool that I’m familiar with and we can now absorb more data then Leo jumps in and starts handing taking part of those data loading processes and so the story here is the same table name that Mary developed using power query that power query transformation can then just get removed those steps can be folded into

39:27 removed those steps can be folded into Pipelines Pipelines and then the table name stays the same the data model stays the same but now we’ve got an it a grown up process that runs on pipelines like to me that’s where I think this stuff makes sense I guess we should move on to well sorry I’ll just pause there I said a lot of things Seth you’re going to say something no that’s fine yeah we can move on to scenario three if you want I’m gonna do Adam Adam our data engineer

39:57 I’m gonna do Adam Adam our data engineer Adam is a data engineer working for a large retail company that uses a lake house to store and analyze its customer data as part of his job Adam is responsible for building and maintaining the data pipelines that extract transform and load data into the lake house one of the company’s business requirements is to perform a customer review analytics to gain insights into their customers experiences and improve their services Adam decides the best option is to use spark to build the extract and transformation logic spark provides a distributed computing platform that can process large amounts

40:28 platform that can process large amounts of data in parallel he writes a spark application using python or Scala which reads structured semi-structured and unstructured data from one Lake for customer reviews and feedback the application cleanses transforms and writes data to Delta tables in the lake house the data is then ready to be used for Downstream Analytics why does that specifically talk about Delta tables in Lake House whereas scenario two doesn’t but the artifact is Delta tables right I don’t I think the artifact or Delta tables is almost irrelevant at this point if anything Adam’s the one

40:59 point if anything Adam’s the one who’s actually studied Delta tables and understands what’s in a Delta table right so yeah because he’s using spark at spark you have to understand fundamentally what’s the Delta table doing so you’re you’re spending some effort there at that level going okay I’m gonna go in deep on understanding what is a Delta table and that’s come dealing with scenario three this is completely I feel it doesn’t even fit on the page to be honest oh really yeah and I’m

41:32 honest oh really yeah and I’m most comfortable with its own page it should be scenario one scenario two please click here for something a little bit more advanced a little more advanced but think about it though how many users maybe the the those who are doing pipelines but doing Advanced pipelines not just a single action to copy data yes obviously Pro have experience with spark with notebooks but for the data flow user how often do you think they’re also

42:04 how often do you think they’re also doing that engineering that Transformations with sparker Jupiter notebooks are going to databricks possibly possibly but they have two they have very very different purposes and also they’re really perform much better for you really perform much better for in certain situations Jupiter know in certain situations Jupiter notebooks to take the files turn into a table I’ve tested this with a fabric Guardian data flows if you’re trying to take a lot of parquet files or

42:34 trying to take a lot of parquet files or a lot of CSV files from a blob or a lake house it’s very slow to try to convert that and do all this the same Transformations you normally would in a data flow and that’s not where you would even think of putting into your production just to take the files and push into it merge into one table so let me let me give you a pattern that

42:58 so let me let me give you a pattern that I’ve been using and since I think you use this pattern very well as well and how I think about this this world right I think of data flows oh so let’s start the paper the pipeline I feel like is best suited for calling apis it makes it fairly easy to call an API data flows also makes it a lot easier or the pipelines make it a lot easier to call a token right so I have I have a an API that needs a token that I need to pass to a secondary API call that then makes gets data so I really like the

43:28 makes gets data so I really like the pipeline for collect connecting to sources and loading them down to the lake like just just getting the raw form of data that makes a lot of sense to me what’s not being talked about here is in the process of loading data to reporting on data there’s this middle area in the middle of the system where it’s called it’s definitely like a black box right data comes in Transformations occur to that data and then out comes star schema or Star schema like tables that you could then use inside a

43:59 could then use inside a data set in power bi so to me this black box middle area is what do I do to transform the data how am I cleaning it how am I building my Dimensions how am I building my fact tables and that is best served either through data flows Gen 2 looking at more of that business user-centric area but if I have more of an it or a a central team working on that that and or if I have like to your point Tommy a lot of small files spark makes more sense to go grab these other semi-structured or unstructured data sets and use spark because spark can do

44:31 sets and use spark because spark can do images you can go you can do other things with it like there’s a whole bunch of other things that come along with the Spark engine that add a bunch of value there and I would also argue in the yes you can connect spark to basically anything but I have found massive issues when trying to connect spark to SQL servers or any other like Oracle servers there’s like all these special drivers that you need to include on the spark system to be able to communicate or talk to these other data source systems so Spark’s not

45:01 other data source systems so Spark’s not the greatest at being able to regularly connect to these other systems and if you do use these custom libraries and that’s my big beef with here what they describe in spark in the table above it says there are hundreds of sources and Spark libraries for sources and there’s hundreds of spark libraries for destinations yes there are but that really limits you and how much you can upgrade your Spark engine if you’re trying to stay current on spark because these a lot every library that you bring to spark has to stay in tune

45:31 you bring to spark has to stay in tune or has to stay updated with whatever version of spark that’s being released so if your library is not being updated your version of spark gets stuck at a certain level and that creates all kinds of problems because the newer versions of spark are getting faster and more efficient and can query data with better speed and efficiency so you always want to stay close to the newest version of spark I feel like anyways that’s my way and eventually you’ll reach a breaking point where your code doesn’t work that’s true yeah to the next version yes so now you’re on

46:01 next version yes so now you’re on versioning and code redeployment correct so if you’re not the one writing the library that you’re going to use to connect the stuff it’s not worth it so anyways that’s my my perspective on these things I I yeah here here’s where where I guess I I’m I’m reading through these different ways to transform data right the obviously different levels of user and their understanding to fit into I guess these three different methods but

46:33 guess these three different methods but where where I’m struggling is does that does that mean we’re creating or have a a base level set of tables that we’re working from right like has somebody because scenario two doesn’t work unless the tables already exist for her to start modifying and then sending data other locations which is still confusing to me yes but okay that person’s gonna start pushing data into different systems I guess the first question I have is is that because I I suppose you’d have a third party or

47:03 I I suppose you’d have a third party or like you’d have a different system where you would need that data as opposed to it just being stored in one Lake and visible for all your reporting needs from a power bi perspective where this where I don’t I’m not putting together the Dots here is if if if the documentation is meant to assist people in making a decision of like how you go about implementing things the use case for which to do so has to be part of it right and it it isn’t

47:34 be part of it right and it it isn’t right and what by that is like if I’m looking at these in the context of a workspace and the permissions allotted to people within a workspace across these ecosystems like and I have a table as a common starting point it’s saying that I could have a user come in called Mary and start doing data engineering activities against those tables for the purposes of the reporting for this workspace workspace for power bi reporting but then moving

48:05 for power bi reporting but then moving data to other locations and what is the difference between that and somebody else who doesn’t have access to the same workspace rebuilding the same thing like because there’s not there’s not inherent visibility across these these workspaces which are delineating the visibility for people to work within the silos and then you’re also assuming that I guess I have the ETL Engineers or the data Engineers that I have an Enterprise focus on them yes create

48:37 an Enterprise focus on them yes create the base tables for these people to work with with and they’re just supposed to see this ecosystem blow up in all the different workspaces where people are just building their own things as opposed to uniformly creating an object that business can inherently understand that is like all of your business logic is wrapped up into this thing and if you need to modify it go ahead but you shouldn’t need to like it almost reads like if if bronze silver gold is the standard in scenario one are they like are we

49:09 scenario one are they like are we creating silver and expecting all these other users in all the workspaces create their own gold layers that’s what I don’t like how are how you butt that up against maybe the mental block that we have which is I would have loved to have seen Microsoft adopt the Matthew Roach architecture Paradigm of how do you level up because every organization he’s a cat team member so he’s engaging with the largest organizations in the world and he doesn’t just come up with this

49:40 and he doesn’t just come up with this framework on his own right he obviously sees the need that organizations and we talk about and it resonates deeply how do you how does it make sense for us to use analytics in our ecosystem I like that I think this is a missed opportunity because to align to the adoption road map absolutely yeah I like I think that’s a great opportunity so okay you’ll have a single platform you have all these tools that you’re bringing together you have this in these environments and like why would you not align that to something that makes

50:12 align that to something that makes business sense to help us sell this to help sell it this is great I love this part this point is really strong here okay so let’s let’s take this out tease us out a bit all right let’s think about Matthew roach’s pyramid right so in the Matthew roach pyramid around different layers of data there’s personal reporting at the bottom of the pyramid there’s team reporting it’s like a group of people a smaller group of people there’s departmental and there’s Enterprise there’s basically four layers here so I would see this article should be written inside the four layers of

50:43 be written inside the four layers of what Matthew’s talking about right there should be four personas one for each of those groups and so there’s going to be Persona number one I’m an individual contributor I’m looking at my own stuff I’m focusing on collecting data that other people have made whether it’s workspace related or something else what do I use I think we would agree that person should focus solely on data flows I’m an individual contributor I’m getting a bunch of stuff I’m using desktop and maybe power bi. com the service maybe I

51:13 maybe power bi. com the service maybe I have a workspace that I’m participating in I’m doing all my work there to figure out what isn’t valuable in the data Next Level Up Team level reporting Team level reporting should be still focusing on I’ve been given data sources from the data engineering team the bi team I’m looking for other data sources that are non-existent in my bi generated tables there’s some extra data that’s coming from Salesforce or other Cloud providers I’m trying to get that data in I would also argue that that second layer a majority

51:45 argue that that second layer a majority that should be around focusing on data flows would you guys agree no I absolutely I think you’re getting to something where there’s absolutely here the doc the confusion is getting a lot clearer these roles in these situations are not exclusive they’re much they’re very inclusive and it’s usually going to be more than one person there should be a team involved with this who what part of the project and what part of the journey

52:16 project and what part of the journey is going to be done in in notebooks and and Spark what part of that process is going to be done in pipeline okay yeah let’s keep going up the tree like you’re saying Tommy like I like that so like the next term on this hierarchy that Matthew roach provided is okay now we’re talking departmental reporting what does departmental reporting look like okay departmental reporting takes things from teams and brings them to a department level someone’s part so we’re now getting to a point where we’re integrating more with a central bi team we’re now talking a bit more with the

52:47 we’re now talking a bit more with the center of excellence we’re looking at data sources that lived in the team Team level data flows that were supporting a small amount of data or not doing incremental refresh and now we’re having questions asked at the department level where we need more capability around incremental refresh or tables or data so the departmental team now starts incorporating pipeline activities so now let’s start let’s start loading lots of data into pipelines let’s start thinking about a medallion type architecture at

53:18 about a medallion type architecture at the department level maybe that makes sense there and then potentially even integrating some some very designed teams maybe some teams are at the department level thinking about using spark and notebooks they’re doing their own data engineering so at the department level the department is trying to grab data from many sources they have the need to load data from

53:40 they have the need to load data from the Enterprise systems or other large Cloud providers and consolidate it for that department like I’m thinking like sales department marketing department right I’m buying the department is buying data from web scraping data from a website right they’ve got to bring that in forward where does that come how do you land that in your in your department right because that department needs the competitive pricing of whatever your product is right and then I think you go full Enterprise where at the Enterprise level we look at the department level reporting and say what

54:11 department level reporting and say what of the things that these different departments are building are now required to be reported at the whole company level right we’re not talking about financial reporting regular sales numbers common dimensions and facts for customers and products right these are things that should be across the Enterprise agreed upon and calculated the same way so everyone’s speaking the same language thousand percent does this make sense like so to me so this is this is the this is the grow-up Story start with

54:42 this is the grow-up Story start with data flows grow up into pipelines and eventually land at the top of the pyramid with Enterprise bi using a combination of Pipelines notebooks and data flows to load your data and the further up that pyramid you go the less you’re doing in data flows and the more you’re lifting over to Pipeline and Spark engines I I would even argue more that with dataflow’s Gen 2 and I think we’re going to see a lot more come from it where you may have with spark

55:12 may have with spark and at least pipelines less actions done there it’s really meant for the data ingestion ingestion pulling together some structured tables but yeah allowing the business user where it’s really not data engineer for scenario too that Persona I think we need to spend an entire episode episode defining actual developer personas that actually exist because there are the ones here data engineer data integrator business analysts I think there’s too

55:43 business analysts I think there’s too much overlap and too much or lack of clarity in terms of what that person is at an organization are they part of the bi team what skills and tools they know where data flows are now to me upgraded more than what we talked about just in reporting Solutions and it can be that business user or the citizen developer who knows those line of business the logic and how can actually push that to the lake house that can be used in apis it can be used in the Power

56:14 used in apis it can be used in the Power Platform to connect where they have that there they are that liaison between the I. T and the business yeah I think more than spark and or more than the data engineer so I’m not gonna I’m not I don’t disagree with you Mike from the standpoint that teams organizational like Department organizational might be using different methods for data transformation I guess and that probably has part of the story of like how we grow up things but to me the how part matters less than the

56:47 the how part matters less than the actual object meaning like the table of information and what it means and who uses it because if if I if I’m reporting at A Team level that is is that Team level using the same source of information that we’re using on the organizational level and the it’s probably no so how do I get yes a report that is at a team level through those stages yes in this ecosystem to the point where there are possibilities to use the same levels of objects that are

57:18 use the same levels of objects that are created in one Lake and that’s where this like that’s the important part in under having the organization understand that those are artifacts that they should be using as opposed to just creating this environment where everybody can build their own thing so I guess my closing thoughts are I think Massey roach’s grow-up story should be part of this to some degree and if it’s on us to like figure out once again how to do it all the framework’s there in much more the same way that I think all this is Consolidated into a ecosystem

57:48 this is Consolidated into a ecosystem I think the levels of difficulty should be part of documentation right like yes we’re going to level up a business User make it easy for them when they interact with the technical documentation yeah hey you’re you’re brand new this is a great place for you to start then you can talk intermediate and Enterprise yep and then like overall the like I think the challenge for us is how do we how do we figure out how this unified platform

58:18 this unified platform is is going to work with the structures that we have related to like workspaces and these objects that were apparently creating and utilizing and want to reuse within organizations how does that all fit into place so that everybody’s on the same page just because we create this environment that allows a whole bunch of more users to do things doesn’t mean that it’s it’s beneficial yet right like and finding like how we we’re still going to have

58:48 like how we we’re still going to have the same problems in analytics which is and data which is finding the sources of information that are the most reliable and most valuable to the organization and does that mean everything has to be Enterprise no it absolutely does not but it does mean that we need to approach this in a way that the organization understands how to utilize the systems that we’re just automatically opening the doors on mm-hmm mm-hmm awesome so I’ll do as we wrap here I’ll

59:20 awesome so I’ll do as we wrap here I’ll load out here with the chat GPT and I think this answer from chat GPT is like 100 spot on especially what we were talking about today so I asked two questions to chat GPT the first one was what is a data engineer and what skills what skills should they have should be effective at a large organization so just like let’s just Define data engineer verbatim a data engineer is a type of software engineer so my my understanding would be is this is a person who’s studied software engineering or software developer or computer science or something along that

59:50 computer science or something along that realm they have a they work with sets of data to create data pipelines processing software and they are responsible for preparing data and analytical operational uses ensuring quality reliability and security I agree with that 100 a data engineer is thinking about regularly getting data and making sure that it loads regularly here’s the list of skills Proficiency in programming language such as python Java Scala or SQL knowledge and data structure algorithms design patterns experience with big data Frameworks such

60:20 experience with big data Frameworks such as Hadoop spark Kafka airflow I wouldn’t need all those but like spark would be the one I’d want to right make sure they had skills on right familiarity with Cloud platforms AWS Azure or gcp which if we go back to our argument here earlier Barry our citizen developer what I would argue does not need to know all the different clouds like it’s not a not a data engineer not a data engineer so just don’t use and I think that’s part of the thing don’t start confusing yes people who work with data as data Engineers they’re not you’re going to

60:50 Engineers they’re not you’re going to lose people correct and then the last couple points for the data engineer which would be understanding data quality understanding governance and security principles and best practices you’re now talking about the scope of delivery of the data in addition to just producing tables of data okay the next one was a question around what is a citizen developer and what can they be to be effective in a large organization this one I really like a citizen desk developer is a tech savvy user end user with the ability to create new software

61:22 with the ability to create new software features or application programs in from an approved corporate or cloud-based Code system so basically someone else has built something and I’m building on top of it they are empowered business users that create a new or change existing business applications without the need to involve an IT department they are using low or no code Solutions and platforms that use visual tools to build pre-built components to create and update applications and I

61:52 create and update applications and I would argue not an application a update or up create or update Data Solutions right that’s how I would change that phrase there they have business Acumen domain knowledge they are creative and Innovative in designing user-friendly Data Solutions they have basic understanding of software development best principles in practice they are familiar with I don’t know what lnc and C platforms are what that is I’ve never heard that one before I don’t know what that is I probably should know what that is I just never heard of it but they’re

62:22 just never heard of it but they’re familiar with okay it abbreviated no code no code lcnc so apparently that’s an abbreviation I should know low code so they are familiar with low code no code Solutions and that are sanctioned and governed by it another key Point here right it’s like here’s things you can use given to you by the Italians the ability to test debug and maintain applications and good communication skills to work with other it professionals did Engineers to develop

62:52 professionals did Engineers to develop what you need for reporting I thought that was really spot on good so I I think the article was okay I think I would have Incorporated more about the data grow up story and I think we probably would have definitely rebuilt these scenarios such that they were forming a grow-up story from a personal reporting solution all the way to Enterprise reporting Solutions that’s to me is is the story that we’re missing for this one excellent with that we’ve burned through

63:22 excellent with that we’ve burned through a perfectly good hour of your time we appreciate your time thank you all so much much I had a lot of fun talking about this one great one great beating up this process here I really liked how this is thinking but with that all we really ask of the podcast is we really love our audience you guys are so great wonderful in the chat love hearing all the conversation there we really appreciate you our only ask is please share with somebody else if you thought this was good if you felt there was some insight here that you liked please share it on social media or someone in your organization or talk

63:52 or someone in your organization or talk to someone else about the podcast Tommy where else can you find the podcast you can find the podcast anywhere it’s available YouTube Apple Spotify make sure to subscribe listen to all of our previous episodes we’re doing this little series on that Decision Guide listen to our previous one listen to the next one don’t forget if you want to for have us talk about a certain Topic in fabric go to Power bi. tips slash the podcast and we have a mailbag submit what you want us to talk about awesome thank you all so much and we’ll

64:24 awesome thank you all so much and we’ll see you next time

Thank You

Thanks for listening. If you enjoyed this episode, subscribe to the podcast and check out PowerBI.tips for more templates, themes, and Fabric guidance.

Data Ingestion Part 2 – Ep. 223

News & Announcements

Main Discussion

Looking Forward

Episode Transcript

Thank You

More Posts

Git Best Practices Diff Noise & Naming - Ep. 513

Publish to Web vs. Embedded - Ep. 512

Scaling a Power BI Side Hustle - Ep. 511