Layered Data Architectures with Veronika Durgin
Shane Gibson and Veronika Durgan delve into the world of data, discussing layered data architecture, data management, and the challenges that come with it. Veronika, a seasoned data professional, shares her journey in the field, her transition from SQL Server to Snowflake, and her current interest in Data Vault.
They emphasise the importance of understanding your application domain and finding a small set of technology patterns to extract data from various systems. They discuss the challenges of integrating software engineering with data analytics, arguing that software engineers should focus on developing capabilities for applications and devices, leaving the analytics to data professionals.
They delve into the concept of ELT (Extract, Load, Transform) and its advantages over traditional ETL (Extract, Transform, Load) methods. Veronika supports the ELT approach, advocating for loading data first and then determining what needs to be transformed.
The conversation moves to the topic of data lakes, where they prefer to keep the data raw and versioned, matching the structure of the source system. They discuss the effort required to model data and the value of doing so over time. They also touch on the idea of a “data lakehouse,” merging file storage with cloud compute.
Veronika highlights the significance of data provenance, stating that when people know where data comes from, they tend to trust it more. Shane agrees, adding that stakeholders often trust individuals over processes. They also discuss the structure of data systems, from the raw data layer to the modeled layer, and the importance of defining these layers and their policies.
The podcast concludes with a discussion on the value of automation in data handling, the importance of balancing speed and value in data delivery. They also touch on the importance of automating as much as possible in data modeling and the need for clear data governance.
- Layered data architecture can help streamline data handling and analytics.
- Software engineers should focus on developing capabilities for applications and devices, not analytics.
- The ELT approach to data handling is preferred over traditional ETL methods.
- Automation is crucial in handling large volumes of data.
- Balancing speed and value in data delivery is essential.
- Data contracts and basic rules can help streamline data handling and prevent issues.
- The use of commercially available or open-source ingestion engines can be more efficient than custom-building one.
- The diversity of systems necessitates a mix of extraction technologies.
- Clear rules and policies are essential for data protection and accessibility.
- Unfettered data sharing can lead to offline data marts, which need to be monitored.
- Understanding data provenance can increase trust in data.
- Stakeholders often trust individuals over processes.
- Defining data layers and their policies is crucial for effective data management
Read along you will
Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.
Veronika: Hey Shane. I’m Veronika Durgan.
Shane: Hey Veronika. We’ve got a good one today. We’ve gonna deep dive into this idea of layered data architecture. So the reason I asked you to come on the show was you did a great post in the data quality Slack group around the architecture that you’ve been running.
I wanted to go through and let’s talk about what it is what the layers are, what they do, maybe some alternatives you thought about, ones that tend to work and some ones that don’t. But before we rip into that, why don’t you give us a bit of background about yourself and your journey in this world of data.
Veronika: Yeah, certainly. So I’ve actually spent my entire career working with data and I started in 1999 and I think why it’s important, maybe slightly funny is because I started as everybody was freaking out about y2k and I had no idea what it was or why I was important, but it was interesting to observe people stressing over it and then, nothing happened and the world moved on and it was great.
So I fell into data. My undergraduate degree is actually in biology. I was pre-med, never really. Had enough passion to go into medicine, which is okay. So I landed with a junior d b a role. Loved it. Absolutely loved it. Love data, loved sickle server. Ended up going to get a master’s degree in software engineering because back then there was no formal education and data.
And spent my entire career in data in various different roles. I was A D B A for quite a long time focusing on performance student optimization. Have real deep under understanding and appreciation of how well systems can run a databases when they’re designed well. And have, t l c going on there 10 11 Care.
And lately I’ve transitioned out of SQL Server into Snowflake and more of analytics. Still love it. Data is my work, my hobby love snowflake. I’ve been really enjoying data vault lately as well , a lot of passion there.
Shane: I was around for Y2K as well and I’m, I was gutted that I didn’t realize I should have become an SAP consultant. And traveled to Europe and made an absolute fortune doing Y2K projects at, horrendous rates in those days.
In fact, there’s still good rates today. I was back in the days when we had Oracle. And we had SQL seven. And so we had the on-prem databases, we had those massive constraints where we couldn’t put all the data in. We couldn’t run any query we wanted because, those things just didn’t have enough horsepower.
And so typically the DBAs were grumpy, because they wanted to protect their baby. They couldn’t let their database go down. So were you a grumpy dba or were.
Veronika: I wasn’t a grumpy one. I was actually, I would call myself a lazy dba , I always wanted to just automate things or do it just once and never do it again. I said my ultimate goal in life is to automate myself out of my job and then I can sit it back. But to your point, I spent quite a few years basically code reviewing, doing code reviews.
And then software engineers were a little bit scared when I walked by their area. I would see they’re just kinda dive onto their desks. They didn’t want to get yelled at. No, I think I was nice, but I also part of my job I used to teach and help others just get better at writing better sequel, .
So I want to say I was nice, but I don’t know.
Shane: , I remember when I was at Oracle. So I worked for the company itself and we had a great consulting team and , there was a lot of database usage at the time and there was a consulting DBA Craig, and he was grumpy, but he was brilliant, but he was grumpy because typically he got parachuted into a customer that was having performance problems with the database.
And I remember , he had this toolkit, I think about it as a disc in his top pocket, although it wasn’t at the time. The first thing parachute him. He’d run all these automated scripts over the database. And that would give him insight in terms of what was running and then what he tended to say was, all the codes crap.
Rewrite your code. There’s nothing wrong with the database. Sometimes, he could re-index it and repetition it. There’s some things he could do, but yeah, he always blamed the code. But that level of automation, that’s what 20, 25 years ago that people were automating that drossy work and I saw us in the data world start to do that.
And then the big data, Hadoop bollocks came out and then the modern data stack and , we moved away from automation back to handcrafting. What’s your view at the moment? Do you reckon we are moving back to the world of automation again? Or are we still in decentralization self-service, craft your own code mess and let’s see how it.
Veronika: oh, there’s so many dimensions to it. I think back then we had to be very careful and also creative because like you said, we were limited by hardware. We were heavily limited. We invested in servers and that’s all we got for whatever, three to five years. So we had to be very careful. We had to do more with what we had, and then cloud happened, and then everybody stopped caring.
It was like, oh, just, throw more power at it. Throw a bigger box. It’s fine. I think we’re back to, oh, but wait, it’s expensive. And now , how much more power can you possibly throw at this vast amount of data that we wanna crank through? I think it’s part of it, we are forced to be creative when we’re desperate.
And I think we’re getting back to that point where we can’t afford to do things manually. We have to be very conscious of how much money we spend. And it’s not that we wanna necessarily spend less, we just don’t wanna spend more. I think it’s a full circle. We came back to, okay, now we actually again, need to care how we use it and what exactly we use to make sure that we’re running optimal things, that we’re actually getting value.
Cuz at some point it’s a, net zero game, right? If you’re gonna invest all of this and you’re getting nothing back then, , why are you doing it?
Shane: , I definitely agree that theory of constraints, once we start getting constrained, constrained by hardware we’re now constrained by dollars for people, so we don’t have the teams that we were allowed to build out and be effectively lazy in how we did it. , those constraints force us to optimize.
And, there’s various ways we can do that. And one of those ways is to actually have an architecture, a data architecture, to have some constraints about what data can go where, what we can use it for, what we can’t. Which, we lost with the big data Hadoop area where we said, just dump it all, let anybody write any query and the machine will take care of it.
Which, which proved funny enough to not be true. When you think about a layer data architecture, and I wanna differentiate between the detailed data modeling for now and the idea of these layers, I think of it as, a cake or a bow tie or a series of steps. In your head, the data architecture that you described on that slack.
Do you just wanna run through that and talk to me about the layers and what they are and what they are?
Veronika: Yeah, certainly. And maybe throw some other things, potential controversial. I don’t know if it’s cool anymore or not. I am a. Support and believer in E l T. I’m a little bit of a data hoarder. I don’t like throwing out data just because I don’t think I need it now. So I truly believe you load it first and then you figure out what you need to transform and how, and when.
I also be in the dba, I always love to have plan B. So I’d like to have a copy of pre transform data because if I make a boo, I can always roll back. So with that in mind. And then I guess it’s like partially personality and maybe partially experience of living through the past two plus decades.
The way I look at data is again, following l LT approach. . Data is generated in source systems, applications, devices, et cetera. It is generated there to support capabilities of those applications, devices. It is not meant for analytics. I know there’s this whole thing, but software engineers store data to support applications and capabilities and devices.
It is not the same as analytics. So as this data lands in whatever data platform you use, pick your own. To me, this data is good to have it there just as it comes. Unchanged. And honestly let’s talk about ods, right? Operational data store, you’re essentially offloading some operational reporting,
again, applications, databases that are designed to support applications are not the same as supporting analytics. So to me, there’s tremendous value into moving this data just as is into some platform that can support that. I considered that to be silver data. Again, it doesn’t matter, it’s semantics, but to me what it means is that we know where it came from.
We know which application it came from, but we don’t necessarily know all the details about this data. Nor do we need to, we don’t have to model everything. Not every data is necessary for downstream analytics, but there is, there’s, again, tremendous value to it. Then some of this data gets modeled for your data warehouse.
And again, pick the pattern that you like, enjoy whatever. I, again, I’m a huge fan of Data Vault right now. And we can dive into that later, but that’s what I consider Gold Data because we actually know exactly everything about it. We modeled it to reflect business. We cleansed it, we caught all the data quality issues that we had to, there’s Lineage, there’s Data Dictionary.
To me, data Warehouse again becomes that kind of like gold data. There’s analytics team that supports it and guarantees that the relationships are built between analytics teams and Business people that generate application data. And then of course there’s data that I call bronze, because we’re going into this decentralized analytics space , I can’t remember who said it and I wish I remembered.
Centralized data, decentralized analytics. So these analytics team need ability to experiment, move fast, not necessarily productionize, but do whatever they need to do to discover, to experiment, test. I consider that data to be bronze. So because they mix and match, maybe they load some files, who knows?
So that data’s still there. They know what it is, data teams don’t. So that’s how I think about data sets. Again, more of semantic. Cause to me, logically it makes sense. It’s also quality of data and who supports it. Where would you go to ask questions about it?
Shane: I really like . Those two lenses, the quality of the data, how much do we trust it and who supports it? Your use of bronze, silver, and gold is very different to the standard in the market. For the audience, I just wanna clarify that a little bit.
We’ve seen Databricks adopt the bronze, gold, silver terminology again for patterns. And it’s typical that the raw data, the data that’s been landed from systems of capture that we haven’t done much to. It’s typically called bronze, the middle layer where we’ve modeled it, we’ve cleaned it we’ve done some stuff to make it for purpose that’s typically called silver.
And then the presentation layer the title cherry on top is typically called gold. And then your definition of discovery, which you call bronze, where we’ve got a bunch of analysts throwing data in and prototyping and discovering stuff that we haven’t automated yet, that typically just gets called sandpit.
It’s interesting that you are using slightly different terms for different things. But the key thing I always say to people is it doesn’t matter which terms you use, just write them down to be very clear. So when you’re in your organization and I hear silver, I know that I’m in the raw area,
I’m in the history or the persistent staging or whatever term we’re gonna use for it. That’s silver.
Shane: So let’s just go back and we’ll just go, step by step through those layers. So we’ve got the. Systems of capture. We’ve got software engineers doing their hard job.
Then we’ve got e l t to bring the data into a lake and then we model it consume it, and then we’ve got some kind of discovery area where analysts can go in and do the bit that they need to do. And we’re talking about a team design typically where we have a centralized data team and decentralized analysts, which is very common.
So let’s look at that software engineer first. There’s this whole move to data mesh, this whole move of let’s take this domain of data and the things that we do, and let’s push it back to those engineers. And it’s very unclear the patterns that need to be adopted. Sometimes I read, we’re gonna get the engineers to do the job.
Sometimes I read, we’re gonna parachute a data engineer into the software engineering team and they’re gonna be a team of one. Sometimes I read we’re gonna have data engineering team and they’re sitting right next to the software engineering team, but, We’re still handing off. And if we look at that first one, pushing the work back to the software engineer, that’s really what we want.
And in, in the nirvana world, when they create their applications, if they create the data that fits for analytical purposes, we’re done. Actually we don’t have a job. But that’s okay , cuz the consumer, the stakeholder can now use the application and query the data from an analytical purpose.
But the problem is, if you are a product owner, a product manager for the application, and you’ve asked them to build a couple of new widgets on a screen for a new flow to onboard some users and the software engineer comes back to you and goes we’ve done all the screens, we’ve done the flows, we’ve all tested, we’re ready to release out to the audience, but we just need a couple more weeks to do the data analytics work.
Nine times outta 10, that product manager’s gonna prioritize, push the product to the customer, get the value, do the data analytics work later, because they don’t really treat that data analytics work as a first class citizen. And if your corporate culture is not to have data as a first class citizen, if you’re not gonna make that trade off decision where you’re gonna wait for the value to be delivered until the analytics value’s being delivered, then it’s never gonna work.
What’s your view about pushing our data work into the software engineering teams?
Veronika: I actually don’t think software engineers should be doing analytics work at all. I lived very long time side by side with software engineers. They’re good at developing capabilities. Unless your company in the business of data like, Netflix, your software engineers should not be building analytical solutions.
That’s not their core capability. We were trying to jam something into them that they’re not good at. They’re good at writing software, GUI website, whatever that may be, that is not data for analytics. So I’m completely against it. It’s like somebody asking me all of time to write software.
The software that I would write be absolutely horrendous. You don’t want me writing software, I’m not good at it. I’m good at data. And the other side of it is even if you drop data team side by side with software engineering team, where does data integration happen? Analytic needs to happen across all of your data.
I have, for example, 20 applications that have data about, manufacture cars, I have 20 applications. If I drop 20 data teams into each application, how do I integrate it? I need to tell a story about the whole car, not just, rubber on the tire. So I’m actually I’m probably, we’re gonna jump all over the place, but I am not exactly.
Spot into data meash as the right thing to do for every company. I think it is something that a very large company with many data teams has to do, but for most of us, I think it’s just an excuse to not do data.
Shane: , I agree. I think there are patterns that fit you as an organization in your context and these patterns that aren’t. And as we know, as soon as somebody comes up with a semi new patent, the whole industry and all the consulting companies jump on it and flick it. I would say though that I believe we could co-locate our data engineering squads, teams, pods, whatever you wanna call them, next house off engineering breather.
And then we know that working closely together as a team, we get value. And yes, I agree that the consolidated reporting the ability of a single view of customer or single view of product, that cross domain reporting, which is where we spend most of our time has never been answered in the data mesh paradigm,
it’s like a data product on top of a data product, on top of a data product. And we’re seeing how that goes. But I’m also a fan of Data like you. So for me, it’s still the best modeling technique for that middle area, for that gold modeled area that I’ve found. I’m hoping there’s a better one coming at some stage that solves some of the problems.
But, for all the modeling techniques out there, it’s the one I think does the best job. And I keep thinking if we push the work back into a data team that sits with the software engineering team, data vault gives us a bunch of patterns. So for example, if we think about the idea of a core concept, of a concept of a customer, a concept of a product, a concept of an order, a concept of a payment and data vault we treat that as a hub,
we go and say, look, there’s a key, I have a customer ID and that’s unique and the first time I’ve seen it, I’ve seen it. And then after that it’s just a bunch of detail about it. And we have a bunch of patterns about how we integrate those. So we have, the ability to lightly integrate where we get told all the keys in the organization are the same across all the engineers, which would be lovely, but we know never true.
And then we have other ways of upping the patterns that we use when we get that uncertainty. But in theory, we could get that data software engineering team to use the same patterns to define hubs, which then allows us to integrate across the mesh teams with minimal effort.
Veronika: I would even take I would make it much simpler. As a software engineer team, you don’t have to create hubs. Just give me all the keys that are required that make it easier for me to tie all the other data together. I think that’s it. It’s if we solve two basic things is add all the dependencies that I need to be able to easily tie data across the company.
And also if you wanna do product analytics. Certain things, just this measures KPIs, whatever that may be. That’s all honestly I would ask of software engineer there. You’ve lived through no sql, right? No SQL is gonna take ’em all. Relational gonna die. That’s just shows you that software engineers good at software, not a data.
So yes, if we ask them to do basic things and whether, data engineering teams are co-located or centralized, doesn’t matter. I think specifically we’re talking about software engineers, here’s three things that I needed to do to make downstream life easier.
Shane: It’s funny you’re saying NoSQL. We’ve got a customer on our platform and they’re a startup. They’re building out their own application. And they’re using a NoSQL database and it’s all good. We’ve got the data coming in and then one day, we had some of our normally alerts go off.
And we look at it and it looks like the product codes that used to be a gud have all of a sudden become text just natural texturing. And you sit there going, that can’t be true. So you go through and you look and you go, okay, obviously we’re sch ischemia mutation we’ve missed and we’re picking up a different field.
And we went and looked and no, the scheme is exactly the same. And , what we found out was the engineer had actually re-keyed their entire database and changed the key structure without telling us. So there are a bunch of immutable patterns that as data people, we want our software engineering boother into, not break.
But if you talk to an organization about data governance it always seems to be, committees and data quality and catalogs and all this big data governance . One of the first principles or patents we should put in is, here’s how we define a customer. Define it that way, or tell us you’ve got an exception.
Here’s how you’re gonna store a certain attributes, cuz we need them. If you’re not, let us know cuz we’ve got some more work to do. There should just be, a small set of immutable principles or policies or rules that you can’t break. Do you find that, do you find that organizations don’t set those basics in place and they tend to try and bore the.
Veronika: that, that used to be the case. So when I was a dba, there was literally five rules about not breaking schema, you can’t rename a column if you want. It’s backwards compatible changes, so whenever you deploy, it has to be backwards compatible, which means you can’t rename, you can’t change data type.
It has to be multi release change. There’s. A handful of these, those are must be down rules. And I think what happened, it’s cuz I’ve also, and I’m sure you’ve heard, it’s oh, you guys are bottleneck, you’re slowing us down. That’s how NoSQL happened. But then you probably also heard this, oh, we can only pull data about 10 customers out of our database before time’s out.
I’m like I bet you’re running no sequel. So I think we’ve relaxed, we’ve been seen as bottlenecks. Data, people were gonna be out of jobs because we’re just, data X happened because it was so easy to dump without worrying. I think we’ve come a full circle, so the data contracts, conversations are starting, but to your point, I don’t believe in if I have to read five page document, I’m not I’m just not going to do it.
Give me five hard rules, that’s all. And they’re really as simple as that.
Shane: If we look at the way the software develop, engineers work, They automate a lot. They automate their testing, they automate their deployments of their services. , they use libraries to make themselves faster when they’re developing. So I’m always intrigued that we’ve not come up with a set of patterns where we say here’s a library.
Deploy that. Call it whenever you’re defining the application, till you, if you are breaking a rule, we care about. And if it does, don’t, because we are gonna get the same alert. And that way, they know what they can do and what they can’t do. And they’ll typically do the right thing. Cause people typically do the right thing if they know what it is.
But at the moment, we we’re after the fact. We don’t engage with them. We don’t tell ’em what’s important. They don’t know that the break that they’re doing, it’s costing us two weeks of redevelopment.
Veronika: And maybe that’s where you were saying where data teams sit close to software engineering teams. It’s again software side. It’s not tools, it’s just people and processes. If you sit together as one team, you collectively want to succeed as a team.
So maybe observing the pain that things like this cause will actually encourage you to do the right thing.
Shane: . And then the challenge, of course, if you decentralize your data teams to sit next to the software engineer teams, now what you lose is you lose the rigor across all your data because you now your data teams are effectively isolated. So you’ve gotta figure out, how to run a matrix model or guilds and all that kind of stuff.
Team design is, whenever I’m working with an organization, first thing I say to them is team design first, figure out where your teams are gonna sit, how the comms lines are gonna work. And then after that we can talk about platform architecture and data architecture. Because, Conway’s Law basically says we’ll revert to the way the organization flows.
So let’s talk about the next step then. So we’ve got our software engineering friends all sorted. They finally give us good quality data that meets our needs cuz we give them a set of rules. And then we’re gonna bring the data into the data lake and, we’ve got that whole E L T E T L E T L t, ttt and I’m like, you So I I break a rule of agility when I talk about this because if we’re being truly agile, what we say is we only touch data that has value.
We only touch data when we need it. And the reason is because as soon as we touch data and we are moving it, there’s an incremental cost to the organization and maintaining that. So if there’s a field and we don’t need it, don’t bring it. Because if the software engineer changes that field, we don’t care because we’re not using it.
I don’t actually run that way. And I’m like you, I effectively will bring all the data in and then effectively land it into the lake. And we call it history. And I’ll land it in roar, I’ll touch it as little as possible. And I do that for a couple of reasons. One is it makes me more agile, further down the value stream.
So what we know is when a stakeholder, says to us, Oh, I’d this bit of information and if we haven’t collected it before, that cost of collection is always expensive. It takes us time, it takes us effort. If we’re already collecting it, then it’s in our domain, it’s in our platform. We can now do All the work forward and we are much faster. So I think by bringing into the lake and then having it sitting there available, it makes us much faster and more agile to deliver to our stakeholders. The second thing is it forces us to automate. If I’m bringing 10 fields in, or 10 tables, I’m probably gonna do it semi manually.
If I’m bringing in 700 tables, I’m gonna order eight. The snot out of that work, cuz I ain’t doing it 700 times. . And the other thing is we can roll back. We have this trusted immutable lake of information that doesn’t change.
And that’s the key, right? We can’t change it. It’s the history. It’s what we saw. Again, one of the principles of data vault , is bring it in and don’t touch it. Always be reconcilable back to the system to capture. You got it from before. You do bad things to it in your model, to your gold layer.
Veronika: I agree with you. But it’s it’s an interesting thing. So between agility only work on what’s been asked, which is I think how the traditional dimensional data warehousing was only deliver what’s been asked. But then the whole idea of a data product is actually expanding it.
So it covers more than just that specific use case. And then we’ve fallen now into data vault pattern where it’s you’re already looking at this data, pull everything related to it. So I usually it depends. I hate being that person and say, you have to find that balance because you also don’t wanna spend hours and days looking at the entire source system and pulling data that you actually don’t necessarily need right now.
And I think that’s where experience comes in, , we have to keep practicing to find that balance. Cuz I, I get pulled, even from my teams, we have to move fast, we have to be agile. But I’m like but that’s not an excuse. You can spend four days and work on something instead of three and deliver a lot more value than that specific tiny use case that you’re being asked to do.
Shane: . And it is a balance, it’s something you’ve gotta balance and there’s a bunch of patterns there as always. If you’re hitting that table and it’s got, unstructured blob objects sitting somewhere in there and you don’t need them, and you haven’t built a pattern for that yet, you’re probably not gonna bring it through.
But if you know that system of capture is not tracking history, it’s not actually keeping a state of what things were, then you’re probably gonna bring it through because, that’s the question you always get asked, as soon as you’ve been asked how many, why is the next question?
And that, and involves the change state. And the key thing is automate it, so where you can automate it. So you don’t care whether it’s one table or a hundred tables you don’t care whether it breaks or not because you’re gonna get alerted. So automate all that jossy work and then volume doesn’t matter to you as much.
Veronika: percent. Hundred percent. And the other thing is back to your question when you asked, build it yourself versus automate, I talk about this a ton. The company you’re working for, you custom building, say ingestion engine, is there value for your com company in that?
Because there are a ton of commercially available, open source, whatever ingestion engines built already. So you’re reinventing the wheel, but what is the value? So just bring something that’s again, like build versus buy. And as engineers will have to build, but at some point we’ll build something that’s actually valuable to your current company, to your current business, whether it’s your own or not.
Not reinvent the wheel.
Shane: Yeah, it’s it’s weird that in the data domain and data engineers love to build the same thing from scratch every time. It’s in, somehow it’s in their mentality as.
Veronika: it’s like definition of insanity. If you haven’t succeeded the first time, you’re probably not gonna do it better the second time. Just bring it in, build something else. There’s plenty of space to innovate.
Shane: Yeah, there’s lots of gaps that we can’t buy off the shelf. And it’s the same as replacing components. Oh, I’m running ear bike now. I’m gonna change to Dexter. Why ear bike’s not really working for me? Why not? What’s it missing? What’s it not doing? Because we tend to just like to re-engineer things because we think the next one’s gonna be magical.
So that e lt , one of the challenges we have is the diversity of systems. We hit. We’ve got systems, we still have on-prem systems around there, but we have systems where, you know, the only way to get the data is go into the database. These systems where there’s value in hitting .
The redo logs to bring in via change data capture technology. There’s ones that have APIs, there’s ones that don’t. I’m still not seeing a lot of systems that have APIs that can handle the type of extracts that we want. You’ve got the big ones like Salesforce where they have the mass , of hardware and engineering where we can treat the databases if it was an api, pretty much.
But that’s fairly unique, , typically, most of the other systems, if they have an api, it’s throttled. It’s only given me the current state. Is that what you are saying? That, we have this nirvana of pull everything via an api, but the reality is we are gonna have always end up with a mash of technologies and extraction technologies.
Veronika: Yeah. That’s what I’m seeing. Exactly. All over the place. You have your own on-prem databases still. So generally that’s where you wanna do some sort of log c, d, c thing to not overload your systems. Then there’s various SaaS applications. Some of them are more mature, where you actually have a stable API you can call.
But again, they’re ing so it’s all over the place. Some say well download stuff into Excel CSV and upload it yourself. What I’m also seeing is that a lot of companies are if we’re gonna use your tool and we’re gonna exchange data, this is still our data. You have to make it easy for us to get it back.
And I think now it’s actually been pushed into contracts. So if you’re not making it easy for us to get our data back, I might reconsider purchasing your tool. So I think collectively it’s getting better, or at least I wouldn’t be optimistic, but you’re right it’s all over the place.
Shane: That data sharing is gonna be an interesting Market, especially because we’ve got that zero ETL bullshit coming up now. And I’m gonna call bullshit on the term, because zero ETL is not a thing unless you are not transforming it and you’re not loading it. I see absolute value in the cloud analytics database having access to a systems that capture without us having to write any code.
That, that’s massively valuable, that’s taking a whole lot of work off us, which would be great. But it’s still extracting. It’s still loading,
Veronika: Oh, absolutely. I can’t I just can’t reverse etl. E l tltl, . Okay. Let’s come up on your letters. It’s, we don’t have to use the, say three letters. Let’s be creative here. .
Shane: yep. So my recommendation always is figure out your application domain. Yeah, so what kind of applications you have? Try and, do a segmentation or a cohort analysis where you can put them into buckets. Try to find a small set of technology patterns you can use to extract data and then apply them to many of those systems.
That makes sense. And then as soon as you hit a new system that you don’t have a patent for, be aware that’s gonna hurt, you’ve got new technology, new patterns, new learnings. It should plum into all your observability or the same patterns that you’ve already got, so you know that the first time you hit that new one you’ve got a problem.
Veronika: When you’re negotiating with a new vendor, ask them, drive that conversation. Don’t take it cuz they might actually be just as willing. Cuz it’s hard for them too if they have to write some sort of custom extract for you. So have that conversation with the vendor, get in, yours and theirs engineering team together.
I’ve also observed in general that a lot of vendors are willing to work with you to just standardize as well. They’re having the same pain as we are on the receiving side of it.
Shane: Our stakeholders go out and spend money on a new capability that only delivers half of what we need and then tells us it’s the data team’s problem to deliver the other half where we had no input. So that’s a good point, right? Try and behave, work with those stakeholders to maybe say if both applications are suitable, the second one has better integration, will save us a lot of time and money.
So we get that data and we bring it in and then we bring it into the lake, which is our first repository of data. And this is where it gets interesting because I’m a fan of the data in that silver or that raw, that, that lake area matching the structure of our source system. I tend to land it I tend to version it, so I get change tracking of change records cuz it has some value.
I’ll do a little bit of metadata work on it, I’ll augment it slightly because it has value for us, but I won’t touch the data. It’s raw. It’s a immutable version of what that source system looked like. Whereas in data bot world, we are meant to vault it at that stage. And I don’t, and the reason I don’t is it takes a lot of effort.
I have to go and do some design and I don’t want to front load all that design work, I wanna get the data in and then do design over time. Still using the vault methodology for me. What’s your view? What does your data lake layer look like? Is it source system specific or is it modeled.
Veronika: So the role layer that, again, to me that’s, I, and I don’t know whatever terminology we, we should come up with. It’s like a data lake is ods it has that, when you said your version it’s inserted only, right? It’s ods because it’s matching your source systems.
I’m a hundred percent on board with that. You can load all of your data like that. Not all data needs to be modeled. So again I love data vault. But. We don’t need to model everything into data vault. You’re right. It’s a lot of effort upfront, especially if you’re just starting, you can’t possibly model everything but there is a tremendous value in doing, whether it’s ad hoc analytics or operational analytics on that raw data lake, you’re basically offloading source system onto your data lake, which can handle various analytics.
Shane: There’s things we can do as we start to figure out what we want to build. We can profile it, there’s a whole lot of things we can apply a contract to it and tells us early if it’s breaking. So I’m with you and if we look at what’s happened now with the data lakehouse, where we’re merging file storage with cloud compute compete against the cloud analytics databases, perhaps we should actually rename this architecture to Vault house, where you have a Lake Lakehouse.
Veronika: Let’s just combine ELT in some other new way that tt, l e, I don’t know. I’m struggling a little bit with is we are generating a lot of data. Across the world. And data Lake has all of it versioned, and then we move it to data vault, which is again, all of it versioned.
Now we have two copies of the same data. So it’s a lot of data. One, and there’s also compliance. Now we have two copies of the same data. So this is where, and I don’t have a solution to this, but in my mind, this is where I’m struggling a little bit. Perhaps at some point we can virtualize data vault. I still really like data vault because you organized data by business concept.
I think to me that is just massive and huge and your source system agnostic at that point. So a, as business, you can continue function without worrying that, oh, this data came from sap, or, Salesforce or whatever that may be.
Shane: I think it’s the theory constraints again, just in the past we used to have a D B A telling us when our queries were needed to be tuned or an index applied. And then that disappeared with the cloud analytics databases where they’re, big enough and ugly enough that we can just throw good volumes of data at them and they just run.
And we don’t need to care about tuning as much as we used to. We still need to care no matter what the vendors say I think we’ll get to a stage where we can virtualize, but it’s not here yet. And, we do that in our platform as we virtualize as much as we can. We’ve got an example where we’ve got, a consumable table that’s just having 400 million rows.
We can’t virtualize that easily without pending a fortune on consume costs every time a query’s run. So we ize that table, but if it’s small, we virtualize it because why not? From a pan point of view, you need a switch. Whichever a human or the system says it, it’s cheaper and easier to virtualize it because we’re not moving the data as often.
Or no, actually there’s value due to the constraints to physicalize that data at this point in time. And I think we’ll see that, come through. We’ve seen it in data virtualization tools like Denodo where they allow us to write a query and we’ll pass it back to three different databases and optimize a query as much as it can.
I think we’ll start seeing that in the data space and ours domain, where we can start virtualizing our logic. I’m just thinking that through now. I’m not sure that’s actually gonna make it easier, because it’s hard enough now as we move the data through our spaghetti lineage graphs because we are changing it to figure out where something came from, if the system’s dynamically virtualizing or physicalizing it for us, and we’ve got a performance problem, we’ve got another lens, don’t we?
Where we go is it the code, is it the logic? Is it the data? Is it the engine That’s,
Veronika: Did you try and then try to figure out data quality of what fell through the cracks? For all of you, engineers out there, when you think of potentially writing another csv ingestion focus your energy on actually thinking through how we can virtualize and how we can dynamically figure out what’s the right materialization is so a challenge out there to the listeners.
Shane: But that’s a big problem. If we look at how long was spent in the database world to get query optimizers, materialized views and optimization and that query plan, all that logic about where to run, where fast. That’s a big technical challenge. We’ll see how that goes. So we’re in the model layer now,
so we’ve gone from the lake, we’ve moved it into the model layer. We both like data bulk as a modeling technique for that, for a whole bunch of reasons. But what we’ve seen with the D B T wave is we’re seeing them, first of all, abuse the term model by taking it from something where we actually do some design to basically calling it a blob of code, which I’m still incredibly angry about.
Not that we have a bunch of code because we just know what was gonna happen when you end up with 5,000 bunches of code. And I don’t know why the market’s surprised now that 5,000 blobs of code have some problems. But we’re gonna model, but most people don’t.
Or is that not true? Am I not seeing the market? So from what you are seeing, are organizations and data teams actually modeling data or are they just doing blobs of spaghetti code that do, create table?
Veronika: Talking about dbt, I have a friend of my relationship with it. And DBT is just a transformation, just like Informatica. There’s no more just write code and transform. So I’m with you on that. And I’ve, I remember correcting somebody once I got angry. Somebody wrote some blog I wrote, then there’s oh D B T models.
I’m like, that’s just transformation. Anyway, so I think there are two sides of it. I think older, mature companies have formal data models illogical models and your physical models. And then everybody who’s new. And I was recently on a podcast when we were trying to figure out the new generation, probably in the past five to seven years.
They don’t need to model because the compute right now can handle spaghetti messy, overly complicated, 15 nested CTEs code. They weren’t forced. Again, older companies had to do it, and they still have that muscle. They continue doing it. They see value in it though, a lot of them rebuild their data warehouses maybe 20 times by now.
But newer ones haven’t gotten into constraint where they have to. And as a matter of fact, I ran into somebody I used to work with recently, in his late twenties, and he I still don’t see value of data vault. I don’t understand why you would do it. We’re just fine doing D B T. I’m like, that’s okay.
I’ll talk to you in five years. We’ll have a completely different conversation when you’ll actually run into all the problems that you’re generating right now. And then we’ll have a conversation again about how data vault is actually useful. So I think it, it depends on the maturity of the organization, I would say.
Shane: With every pattern there comes a pro and a con, there’s reasons that it has, it’s useful in these things that it’s not so good at. And one of the things about data vault is it’s incredibly complex when you look at the data vault table structures because we get a lot of them,
but. The and so we don’t typically want our consumers to have to query data vault. We wanna put a layer on above it to take away that extra effort from them and make it easy. But one of the things I’m always intrigued of is when people start to use data vault and they’re still hand coding the code for vault and automating it, then you’re crazy.
Veronika: Please don’t do you hate yourself? I remember when I first started with Data Vault and there was no tools. I just, I was experimenting and it was like at least 50% of my time was copy pasting. Like just all this hashing and concatenate and all the columns.
I’m like do you hate yourself? Fine, you don’t have budget. They’re open source tools that can make life easier. So I’m a hundred percent with you on that. But also data side, like I’ve seen enough failed dimensional models data vault isn’t any harder and you can mess it up just like you can mess up everything else.
Shane: A failure is not purely around the modeling technique. It’s around your team design, your ways of working, the skills you have, the roles you have. . There’s a whole lot of things that cause us to fail or not fail. So one of the tricks though, with data vault is , even though it’s a really, in my head, a really simple conceptual model because there’s only three objects, hubs and links and a couple of outliers.
Some, we seem to make it complex, do you have any theory on that? It seems like dimensional modeling seems to be easier to understand at the higher grain, at the conceptual level and data vault seems to be harder and I’ve never figured out why.
Veronika: I don’t know why either. I honestly don’t know. To me the hardest part, removing the type of modeling, actually understanding the business what I’ve personally experienced is engineers reproduce source systems. It’s easy for us to write a bunch of SQL queries, look at the data, and then.
Map it. Maybe that’s how they model dimensional model where it just reflects the source systems in data vault. We have to float above all of that, disconnect ourselves from data and actually understand how business works. And I don’t know if you’ve experienced that. Some businesses, for me personally at least I have easier time to relate to and others I have very hard time relating to.
And modeling those ones is just difficult. It is really easier. Just write a bunch of queries and just copy what I already see in front of me. And I think that’s where actually data vault fails. As you talk to business, you identify so many gaps that your 20 applications and source systems will never tell you.
Shane: Or we are just under pressure to do it quickly and so we, we give up and we go all, I’m just gonna write some inqueries against that data. Answer your question. And I know you, I should be modeling it cause I know you are coming back cuz that’s one of the things we know. We know. The first question is only the first question.
They’re just, they’re scoping out what the next question is and they dunno what it is. So when it say how many orders we got, the next question is gonna be where, how much, how many.
Veronika: We have this popular children’s book in the United States. I don’t know if you’ve heard it, how to give a mouse a cookie. So you give a mouse a cookie, and next thing is ask for milk, and then it basically moves into your house. That first question is just that first cookie that there’s gonna be definitely glass of milk being asked right after that.
Going back again to data products. You need to have that general idea of what else is coming so that you build a little bit more so you don’t have to rebuild things all the time.
Shane: And also automate as much as possible. So if you are using Data Bot or any other modeling technique that is repeatable. .
Veronika: If you’re using anything, like at this point in any modeling, anything you have to automate.
Shane: So what’s your view on source specific data vaults? And what I mean by that is so I’ve got say Shopify data coming in and I’ve got HubSpot data coming in and they both hold customer. But the question I’m getting asked is, how many customers have ordered a product and that’s only in Shopify.
So our natural reaction is to do a source specific data bot. I’m gonna go create a hub, which is Shopify customer. I’m gonna get that data out. It’s likely modeled, so it still has some value, but I haven’t gone and figured out the core concepts across the whole organization, the integration keys. And then HubSpot will, we’ll get a question, we may do a source specific hub sort model, and then we’ll try and integrate the two and deal with that horrible thing.
What’s your view on that? Cuz data vault 2.0 says don’t but they also don’t say use a data lake. What do you think?
Veronika: If you mimicking your source systems, why bother data vault? Like just stay with your, data lake. But for me, like I’ve been around long enough to know that the first thing I would ask, gimme definition of a customer, and customers always like the worst thing actually to ask about definitions as you go across the company, across departments, across teams, you’ll get 50 different definitions.
And that’s okay. But I won’t touch anything until I get a plain word definition of what something is. And it’s actually it’s quite fascinating. , I’m not falling into that trap. To point again, why build data vault? If you building something that mimics source, just use your source data at that point. And the other thing you lead into is data mesh, that’s also what boggles me. If your team works on HubSpot data and another team works on Shopify data, they build their own analytics and the CEO is like, how many customers do we have? Then what happens?
Shane: And are we gonna say to them, oh, don’t adopt Shopify because we don’t have a single view of customer definition yet. . That’s what should happen. We should actually say, , before you implement any system of, capture, any operational system, you need to define what a customer is.
But I don’t, I’m not sure I’ve ever seen that.
Veronika: Probably not. And I don’t think it’ll ever happen. It’s like being a good architect takes a while, and it’s also a bit of an art to it, and there aren’t many very good architects. So I think we’ll continuously build and then fix,
continuous loop of our world.
Shane: if you look at the building and construction industry it’s we’ll just go pour some various foundations wherever you want. And that’s okay. We’ll be able to build a house on top of it. Don’t worry, we.
Veronika: It was like street paving, right? Like they repave it every couple of years and you’d be like, oh, by now, do you know how.
Shane: Okay. So we’ve gone system to capture and the teams around that. We’ve gone into E L T and into a data lake or persistent data store, which is, maps like an DS does. The source system tables gives us some changes. And that immutable history, we’re modeling it using whichever technique we prefer, but we both prefer data vault to, to give us that value of that modeled and design data.
I would typically then put a consumer layer on top, a presentation layer. So I’d typically put either a modeling technique or a technology that makes it easier for our consumers, to use last mile tools to use that data. So I always have a third tier. What about you? Do you tend to have
Veronika: Same. Yeah we definitely again I really like presentation layer. Data vault is great, but it’s, it is complex. It’s not for everybody. But that presentation layer to me has like a few characteristics. And again, kinda like curious, do you use Dimensional? I’ve seen a lean towards one big table, but you can virtualize it, so it doesn’t really matter.
Before I heard the term one big table I used to call it flatten the hell out of it. Data buffet, whatever that is. But to me, you have to define the grain. If you can define a grain, that object cannot be presented to users, and it has to have a data dictionary. Every field in that object has to be defined in regular plain words.
If whoever’s building it can’t explain it, if the person requesting it can’t explain it, then it shouldn’t be going out. So those are just like this handful of rules. And then once that’s out, then anybody can understand how to use it, what’s in it, how to join it to other things. It also has obviously like keys, naming conventions
if there’s a key here and the same name, they better be like joinable, right? Not mean actual, different things.
Shane: Yeah, so data buffet, that’s a great term. Again, when we have our vault lake house we should data buffet it. But it’s good because it’s that whole idea that if we’re gonna eat something, we typically want list of ingredients. We wanna know what’s in that meal. So when we get this data served to us let’s find out what it is.
I’m a great fan of one big table. Everybody keeps telling me, I know analysts, preferred dimensional modeling, and I’m gonna call bullshit on that. I think they’ve been taught to use dimensional models because that was what they’ve been given. So they understand how to link dimensions to facts, but, I think if you’ve got somebody that’d been in, your five to seven year generation and asked them what an s e d type two was and how they made sure they got the current value or the value at a point in time, they’re gonna look at you blankly cuz they haven’t been taught that.
So I, I’m not sure that, dimensional model is easier to understand. I just think it’s been taught well. One of the problems with one big table, and it was called out in social media just lately, is while we are using column A databases now, which means if there’s 2000 columns and we’re only using five, we don’t care because we’re only paying for the compute for five.
There still is a problem. When you give an analyst a 2000 column table, it’s just, it’s too big for them to find the columns they want. So that’s the downside. If the grain is order line and they want to count customers, they have to understand to do count
Shane: That’s no different understanding how to join a fact and a dim.
And typically, we serve a table of customers and a table of events separately. So you can always go and see the customer table. And cuz why not? And that’s one of the things again, is that if you’re using a modeling technique in the middle, You should automate the creation of your consumption of your presentation layer cuz you understand the patterns.
With Vault we know that if you take a hub and a set put it back together, you’ve effectively got the equivalent of a dim. If you take a link and break it out to one big table, you’ve got effectively the same as fact with the dims already embedded in it. And we can automate that,
we know how the relationship of that model works. So we can write code that automates that consume. And so the same with the technology we use. Some can be virtualized, some can be ized, some can be summary, some can be fresh point in time. They’re all just patterns that we can use to craft the right solution for our consumers, and that’s what we should be doing.
Veronika: , a hundred percent. And the other thing about Dimensional model, I think a lot of tools require that, and that’s part, at least like what I’ve observed, popularity in the United States is dimensional model. It’s not that it’s taught, but to, to your point, that CD two like blank what are you talking about?
You’re a hundred percent correct, but I think facts and dimensions is what a lot of visualization tools require, and that’s what a lot of analysts are used to seeing.
Shane: I think Europe has a stronger vault population and America has a stronger dimensional, which is really interesting given the vault came out of America.
Kimball did a great job of training and educating, did a brilliant job of helping people use that technology vault, not so much.
And the second thing is there’s these urban legends. So the, the one I always get is power BI prefers a dimensional model. Either one big table. And I’m like, really? But if it’s doing direct query and it’s passing it down, then the database is doing the work. So are you telling me that the Dax engine prefers dimensional?
Like where’s the bottleneck? You tell me where the optimization is, and then I agree with you, but most people just go, oh no, it runs faster. And I’m like, compared to what?
Veronika: no. Yeah, you’re right. But it’s there’s dimensions and measures and tools. So I’ve seen, Looker as an example. There’s dimensions and measures. So it, it maps to dimensions and facts easier, but the underlying model doesn’t matter.
The other downside of one big table is if you’re materializing a table, then maintaining, it’s hard when you have 200 columns and some need to be updated, merged. That’s where , at least from like operational side, the overhead comes in, but if they’re virtual, then it’s fine.
I think with data you do need to know what you’re doing or do it and learn and then improve.
Shane: And so there’s trade off decisions with every pattern, pros and cons, things that work and you get value and things are gonna bite you in the bum. So let’s move on to that last one, which is discovery or what you called bronze, so this idea of sandpits, this idea of people playing or people experimenting or people researching.
And typically that we either be your data scientist that need full on yeah. Analytical feature factories and email models and chat G P T and then your analysts that are probably more draggy, droppy, drops some data in or some Jupiter notebooks and those kind of things. And typically in the past, we always had that as a separate environment.
It was always off to the side. It was called sandpit. We built a whole lot of policies around what they could store in there, what they couldn’t. We put a massive constraint on it so that it was never as fast as the rest of the platform cause we couldn’t afford it. What are you seeing now? What are you seeing?
Your bronze discovery area zone
Veronika: It’s very similar. Only my experience in the past, whatever, five, six years been with Snowflake. So you don’t separate environments, you do zero copy loans. It’s, but it’s still very much, it’s sandbox it’s isolated to teams. It’s like there, like a team sandbox or personal sandbox you can share because there’s , All that compliant data.
Just because you have access to it doesn’t mean that somebody else has access to it. And as they’re done and they’re , okay, this is what I want to go to production, then it has to go through the process. Then it has to go and land into that gold data, right? Ultimately, because otherwise data governance is all gone byebye.
I don’t wanna be a bottleneck. I appreciate the fact that, data science teams, advanced analytics teams, they actually do need to experiment in a lot of what they experiment with. Might not see the light of day, but the parts that do need to be productionized and support the business, they have to go through the process and be documented and have lineage, and have the data dictionary just like everything else.
But that hasn’t changed. They just live on the same platform now.
Shane: I did a project many years ago with Mapar back in the Hadoop world. And I was trying to get my head around this idea of a data lake and how it was different to a persistent staging area apart from the technology. And one of the patterns that I loved was this idea of bounding policies and patterns by a community of people.
And what I mean by that is say here’s a group of people and here’s the collaboration they want to do. And here’s the boundary and here’s the gates, here’s the policies that actually have to be met before we could go outside that. That same scenario you’ve got where there’s a pattern of this is my space, I can do whatever I want, but I can’t share it.
This is my team or my community space and I can do some work and share it with my peers for review, so they can use and get value out of it. Then I can go to a bigger bubble, which is a community, that might be a business unit or domain across the organization.
And so again, that’s a wider audience. And then from there, maybe I go into everybody and then from there into external, outside my organization. If you think about each of those bubbles, when we move data outside the bubble to the next bubble, the blast impact of that data is bigger cuz more people have access to it.
Therefore, we should invoke more policies and processes around it because bad things happen, they escalate the more people that get impacted by them. So I think that’s what you were saying, is treated as that, that those bubbles of people in community and how far out. Yeah.
But again, when we look at data governance, people don’t deal with it that way. They, again, go back to the data governance committees and their documents and they don’t look at, they don’t even do a simple model that says, if it’s gonna more than five people, it should be peer-reviewed.
If it’s gonna more than 20 people, it should be automated. Just simple policies around those bubbles.
Veronika: Lowering the barrier for data ops the principles, C I C D, PR reviews and that’s where, D B T, your modern data stack, the uncool modern data stack comes in handy and just again, deploy actions, merge pr, reviews, rotation, like we are lowering the barrier for that.
And then other teams, even within their own bubbles, they can follow this pattern. And you already kinda like on your way to productionize if you wanted to. So I think that’s where, when you’re right about data governance committees, some become just , like a ton of bureaucracy and
Shane: They’re after the fact. They’re reviewing it after the fact rather than setting the policies up front that are valuable. And that point about reuse, dataOps and reuse is really interesting. On another podcast I do, we had , a couple of people came out of Spotify and yeah, the Spotify model as well known they hate the fact that it’s called the Spotify model, but the Spotify way of working back at that time is well known.
And one of the things that was interesting about that way of working was each team was completely decentralized and self-autonomous. But if there was a capability in the organization that. More and more teams used then that kind of became the defacto standard. Not that people were lazy, but they just said if we’re gonna use a ticket management system and everybody’s using that piece of crap called Jira they just called it Jira.
They go why wouldn’t we use Jira? Because everybody’s using, everybody knows how to use it. The supports there. We can move people in and outta teams and they’ll understand the tool we’re using. So they’ll naturally coalesce to using the same tool. So if we’re sitting out in that big bubble of, production capability we should be picking up those tools and techniques we have and try to make them available to people in other bubbles and make it easier for them because they’ll just use it because it makes their life easier.
Like ours don’t hold it in our little in our little bubble. And that’s the same with access to the lake. So you see organizations that say analyst and scientists can’t have access to that roar layer, that silver layer in your terms. They can only go to modeled, but all the behavioral data is in that raw, in that history layer,
so they have to have access to it. So again, do you.
Veronika: It’s funny, I was just, I just came back from Snowflake Summit and I was talking about security and you go for data democratization or data , dictatorship, the pattern of least privilege, right? Data is safer. Forget about anything else. But data democratization is where you want for everybody globally to have access to all your data because that’s when creativity happens.
That’s new. New things for your company happen, but then the other side of is what I’ve seen as well is people have access to all the data. They bypass the process. It’s a backdoor. So they then create these ad hoc, who knows what some random stuff that’s ungoverned.
So there again, it’s we back to their trade offs. But I truly believe that, everybody should have access to data you wanna anonymized, compliant data, but in general especially in the world we’re living now with just everybody needs to get better at data.
You will find some unsupported, unofficial things popping up where people just don’t wanna follow the process because it’s a bottleneck. And move off and do something else.
Shane: Some people don’t want, to fill out an expense form to get their money back. But they don’t just pay themselves, they don’t just go into the corporate bank account and take the money. There’s a couple of rules to stop them doing that. So we do need some rules and policies.
The one that I always crack up about is, we restrict access to the raw data. What we’ll see is, we’ll see spreadsheets, Google Sheets, turn up with that data cuz somebody’s had access and then they’ve shared it with somebody else. And now we get these, offline data marts and Excel because people are just gonna get the job done.
So we need to protect the data. We need to protect, but we need to be very clear about, that data. What actually do we have to protect? And then let’s protect it properly. Let’s make everything else accessible. Let’s make sure it’s monitored. If the data does turn up into another database somewhere and it starts becoming a semi production database, that should be observable,
We should see it over there and we should say what’s the problem? What’s the bottleneck? How can we solve that for you? But it is hard, putting a big gate on it and saying nobody can have access. It’s so much easier. We feel safe as as a common pattern.
Veronika: I think the other thing, and I don’t know if you’ve seen it clearly, if there’s a clear way to, to show people where data came from. You are looking at a report and that report is there is like a stamp. It says gold data. You’re like, I am comfortable. I know exactly what it is. I know where it came from.
I know the team does. Versus somebody gave me a G sheet. You’re like, huh. So I think collectively it takes a little bit of time, but I think as a company, as long as you can clearly define where data came from as a company, people start to lean towards trustworthiness. Not some, yeah, maybe it’s quicker, but I don’t know if it’s correct.
So it just takes a little bit of time.
Shane: I think stakeholders trust, people that don’t trust process, that’s they just look at, they go, oh, I know that Jane, this is the analyst that always gives me my numbers. And numbers are always right. So Jane, yeah. Where’s my numbers? Alright. Excellent.
Hey, look, guys, want to. Wrap that up. So what we talked about was, we talked about systems that capture the software applications, the engineering teams and some of the challenges and things we can do around there. Then we talk about, e l t, not e etl, into a first layer lake, raw history silver, whatever you want to call it.
So immutable version of the database on our source systems. Then the middle modeled layer, gold designed whatever you wanna call it trusted. Where we are modeling it, we are cleaning it. We’re, we are doing a whole lot of, we’re inferring values that don’t exist. We’re inferring values in terms of KPIs then some form of presentation consumable.
Make it easy for the people that need to use it to get it in the way that suits them. And then some form of discovery or bronze where, you know, the analysts and the scientists can go and do that. Discovery sandpit type of work. And the key thing is to find those layers to, add a couple if you want to take one out if you don’t think you need it.
But just draw a picture, put some boxes, give it names, and then focus on the policies. What can go in there, what can’t, what will happen in there, what won’t, who can access it, who can’t? Those are really the core things you gotta define for each of those layers. And if you do that, you’ll be in a much better place than chaos.
Don’t over bake it because you’re just gonna put constraints that people will bypass when they have to. So yeah. Is that kind of how you think about it?
Veronika: Yeah, exactly. And I think it’s define it. Just use words. I, at the end of the day, name it whatever you want, but define it. And I also spy a book. I know this is a podcast audio, but the project, Phoenix. Phoenix. Phoenix. Isn’t that a delightful book to read?
Shane: Oh, I love it. It’s what I loved about it is you’re learning lean manufacturing and DevOps without knowing it, they’re just telling you a story and it’s oh, wow that’s lean.
Veronika: and relatable isn’t it? It’s almost like I remember that. Yeah, I remember that too. Yeah.
Shane: Some good stories in there you go. Oh, I know. Was it Steve, wasn’t it? Yeah, I know. Oh, no, Brent. I can’t.
Veronika: Brent. Brent.
Shane: Yeah. Yeah. I know. Lots of Brent’s Actually. Maybe I’ve even been to Brent. No, I haven’t been, but yeah, I know lots of Brent’s.
Veronika: No, I think I was a Brent at some point or another in my career for sure. Yeah,
Shane: Yeah. And the last thing for me is really interesting is that you came from biology. I am always amazed at how many people in the data and data science, particularly space, have come out of biology, side of science. It’s really interesting that biology space seems to be a space that lots of people have come out of in terms of their training and education.
Veronika: probably because a lot of us had that naive dream of going for a medical degree and then realized that it’s a long. Journey. Journey and you really have to love it to do it. And we realized that we probably don’t love it as much. Yeah.
Shane: I suppose data’s as messy as human bodies, like we’ve just coined the Vault data house and and data buffet. You could just rename yourself the Doctor of data. There you go. And
Veronika: we go. Yeah.
Shane: excellent. Hey, look, that’s been great. I think we’ve we’ve managed to cover each of those layers. And I think it maps back to what you said in the Slack channel. So that’s even better. But hey, thanks for coming on the show. It’s been great to go through each of those layers, what they aren’t, and back to that key message.
Just, define what yours is, put some words on it, put some boundaries on it, and then make sure that’s what you’re executing or you’re realizing you’re changing it. And change is okay, but it comes with a cost and a consequence. , thanks for coming on the show and I hope everybody has a simply magical day.
Veronika: Thanks for having me.