Merging Data Vault and Medallion Architecture Patterns with Patrick Cuba
Guests
Join Shane Gibson as he chats with Patrick Cuba on combining the Data Vault data modeling pattern with the Medallion Architecture pattern.
Listen on your favourite Podcast Platform
| Apple Podcast | Spotify | YouTube | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |
Podcast Transcript
Read along you will
Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.
Patrick: Hi, I’m Patrick Cuba.
Shane: Hey Patrick, thanks for coming on the podcast. Today I want to talk about this term that you’ve coined, modern data vault stack, what intrigues me, having looked at what you’ve written, is you’ve taken some data vault patterns, and you’ve taken the Modillion architecture patterns, and you’ve blended them together in a way that People can understand how to apply data bot modeling into Medallion architecture.
But before we do that, before we deep dive into that set of patterns why don’t you give the audience a bit of a background about yourself?
Patrick: No worries. I’ve been around in the data space for a long time, maybe too long.
, my background is I was a hardcore SAS architect. You could Think of a SAS architecture and maybe I’ve heard of it or maybe I’ve worked on it per se. And I actually came across Data Vault at a customer site.
We were a few contractors out in Brisbane actually. And the information architect wanted to. Or actually discovered this thing called Data Vault. And as a bunch of contractors, we thought, all right, we should get trained on this stuff and to better support our the mutual customer.
We did, and we quickly realized. How patented Data Vault can be. Looking at each other, we thought, wait a minute, we could write or design a tool that could automate these patterns some way, somehow you map what is the hub, what is a link, what is the satellite. And then you let the code take over as deployable templates and then build a Data Vault model.
Of course, us being contractors, the first gig we make mistakes, but that’s the agile way to go. And we learned through iteration through collaboration and so on. And over the years, As we became more interested in what other people are doing with Data Vault.
So I met in Brisbane at the time Roland was there, Roland Foss. So I met him in person. He’s a great guy to talk to, to bounce off ideas. And he was pushing the BIML. Type of implementations for DataVault. I subsequently joined a DataVault consultancy and was doing that sort of full time and delivering , either Warescape based DataVaults or advising customers on whatever they’re doing.
In their space with regards to Data Vault, and that included NoSQL platform, which my background in developing a Data Vault solution or tool helps in developing their Data Vault stack because they decided we don’t want to use a tool. We want to build our own. which was really exciting because at that time I didn’t know a whole lot about Scala and Spark and that kind of stuff.
But I learned a lot from working with those guys. So I was learning from what their expertise were and they were learning from what my expertise was. And I was super grateful for that engagement because it was originally, a two month gig, but with I guess the people can tell when you’re enthusiastic about something that gig itself turned into one and a half years, of oh yeah, okay, we can now also do this, we’ll do that and so on.
And it built from there. I joined Snowflake nearly four years ago. And to be completely transparent, I was not intending to do any data vault work because it was about, getting customers on boarded onto Snowflake. I was a big fan of Snowflake already. But there is a need for how do we do this on Snowflake, for example?
And common questions which has led to me developing patterns and experimenting with things and creating blogs, which is something I enjoy doing. Webinars is things around okay Snowflake doesn’t have referential integrity in terms of indexes and primary keys and so on. What do we do now?
When I joined there was a clear need for somebody to fill that void.
Which luckily for me, I did have the time , to experiment and develop. And when Kent decided to part ways with Snowflake, semi retired, marketing, asked me to write some blogs about Datavolt.
Which at the time I thought, okay, about Datavolt on Snowflake, how do I tune it? Into something that’s consumable. And I came back and I said. Great. I’ve got an idea for a series. So it ended up being like 12 blogs in a series, which built on, from the very basics, what’s partitioning, what’s pruning, what’s all the essential things which .
Actually talks about big data in general, because everything’s file based now. Everything’s Blobstore. What does that mean on Snowflake and how do you manage that? And what we were discussing just before the start of this podcast it is the partitioning, of course it is, okay, let’s throw more memory at it, but it’s not always the right solution.
But the third important thing to me at least is what algorithm are you enabling by the data model that you’re using? And it’s not just. Data Vault it’s any type of model. Sure. You’ve answering the business question. Sure. You’ve modeled, let’s say dimensional model, which in Data Vault is a satellite.
But how have you structured that so that you actually using the right algorithm? And , just before the start of this podcast, we were just talking about my history and stuff. And it’s not to be arrogant in my opinion, but. A lot of this stuff is , not documented very well, unfortunately, but it’s been around for decades already.
And in my opinion, it seems to be forgotten because in today’s modern data architecture I think data engineers are asked to do a lot. A lot more than what they’re originally scoped to do, and they’re overburdened with learning and keeping up to date with all the new technology that’s coming out there, all the new tools.
Oh, this does this thing better, or that does this thing better. But at the core, you’re still running on a data platform whether it’s Snowflake, Databricks, BigQuery, whatever. What is it that you need to do to take advantage of that at best? And that’s the space that I found myself in Snowflake.
Guys, let’s go back to, first principles of why we’re using this database in the first place, whether it’s Data Vault or not.
Shane: Interesting, your background because I’ve done a lot of work in SAS in previous lives as well, and it’s always intrigued me that Brisbane was the origin point for data vault in Australasia. Got adopted there early and then fanned out. It’s really, really weird if you track all the Data Vault implementations in Australasia over the years.
So the other thing you did do though, was during COVID you wrote a book around Data Vault implementation, ? So there’s a whole lot of detailed knowledge there around Data Vault modeling and how to implement it, the actual, Technical implementation rather than just the modeling technique.
I just want to talk about this idea of Data Vault and Medallion architecture, because there are two terms and patterns that often get treated in different ways. So before we start, I just need to do some anchoring. when you talk about Data Vault, how would you describe what you’re talking about?
Patrick: I’ll start with and I will read this portion because it’s very short, but it is the definition and I’ve used it in presentations to describe Data Vault. This comes from the inventor, Dan Lunstad, of course. It’s a system of business intelligence containing the necessary components needed to accomplish enterprise vision in data warehousing and information delivery.
Now I’ve used that definition specifically at actually multiple conferences last year. But I focused on the words enterprise vision. Cause there’s a lot to back in that short sentence, which is probably why I think it’s very valuable. Not only data warehousing, but information delivery and business intelligence, but the enterprise vision for me says, okay, what are we actually doing with this data model?
Let’s take away Data Vault for a second. If I was to model a third normal form or information marked as Kimball the three main things I’m always after in a data model are the business entities. What business entities is this business entity related to as in interactions and transactions and business events and whatever.
And then the state information , of those entities, as well as the relationships. So hubs, links, and satellites. Why pick Data Vault over another data modeling technique? So far, from what my experience and what I’ve seen, nothing in the industry takes advantage of the OLAP platform’s capabilities quite like Data Vault does, and nothing embodies Those three things, as well as Dataflow does today, in my opinion.
There are variances in techniques and opinions. Of course they are because people come from various backgrounds and they build their models and their architecture to suit their use cases and their consumers, so to speak, and I’ve spoken with quite a wide range of different Implementers around the world, especially with my role in Snowflake.
And I bring up concepts like business key collision codes and that kind of stuff. Some of it resonates very well. Some of it does not. Someone’s why are we doing that? And it’s a fair point because maybe you’re not integrating. 30 different data sources into a single canonical model.
That’s fine. Then maybe you shouldn’t consider Data Vault. I’m quite happy to say, , looking at your use case, I think the complexity is not worth it. You should stick to Kimball modeling. It’s much easier for you. It doesn’t look like you’re going to be changing a lot of things in your data model anyway.
And that’s the gist of the data vault in the first place is I can easily integrate by business key for those hub tables. And if, as my organization evolves and there’s more data sources or existing data sources change, the data vault model is designed to ingest those non destructively.
Shane: I think the key is there’s the data vault modeling technique, the patterns around hub sets and links, pit tables, bridge tables, , the physical modeling and then there’s everything else around it. So when people talk about data vault 2. 0, and they talk about Dan’s approach, I’m like, shh.
That is a way of working, as a form of framework, methodology, bunch of patterns that is more than just the modeling technique. When I talk about Data Vault, I focus on the modeling technique because that’s the piece that I use. I don’t use the rest of the framework. And that’s okay. I have my own way of working of which Data Vault is a key.
And I also follow Joe Reese’s approach to mixed modeling art. So we will use One Big Table, Activity Schema, Data Vault, Stars sometimes where it makes sense, where it has value given the context. For today, I want to talk about the modeling patterns, cause that’s what intrigues me of Medallium.
The second question then is How do you describe medallion architecture? If I said to you, what is medallion architecture? , what’s your anchoring statement? What do you use?
Patrick: We were talking about things that, that hasn’t changed in 20 years or longer before starting this call. It’s nothing new in my opinion, Medallion architecture. It’s very well marketed in my opinion. So everybody seems to go, Oh, I do Medallion architecture. I don’t know.
What is it that you’re doing? I’m landing data, I’m transforming, cleaning it, and I’m making it available for presentation in some model, bottled form that’s easy to ingest. We’ve been doing that for a long time. And that’s my premise of Medallion architecture. It’s okay, We’ve landed the data.
Okay. What are we going to call it? So in the industry there’s been various terms describing the same thing. But if you speak to, somebody from a particular background and say, Oh no, but it’s slightly different. We’re doing this. Say you’re landing raw data and then not doing anything with it and then transforming it.
Okay. So that’s your bronze ladder. Is that what it is? Yeah. Yeah. With reference to what I’ve done with customers or worked with customers is especially in the last year there, there was a big customer that I worked with that’s actually New Zealand based and they didn’t like the name, medallion or bronze and silver and gold.
So , we use different naming for it’s like a curated zone, coherent zone and intelligent zone. I thought actually that’s a pretty good description. It’s doing the same thing, but it’s more descriptive what exactly what we’re doing with the data. So when we landed, we were using the platform’s ability to do classifications and tagging and so on.
We call it the curated zone because it’s been curated and we try to automate it as much as possible. The coherent zone, I really liked the name that they used instead of, the gold layer or integration layer because of what they defined as coherent data. Which kind of aptly describes Data Vault in the sense.
They had a wonderful definition. They said coherent data is data that lives a connected life. Actually, that’s pretty good. I like that. Which exactly speaks to how Data Vault works in terms of integrating at the hub tables. And another customer called it . A cell is a source aligned layer. And then the EDW is the integration layer. So it just maps very nicely. And we were mixing a topology of data mesh with data vault. But we were of the opinion that completely decentralized doesn’t work. Particularly for this operator model that we wanted to enrich the existing data using existing or maturing.
Patterns as we were defining them and having consumer domains use that enriched data, bring their own business case. And if we already are supporting certain data or certain source data that they need, they can already pinch and pull the data they need from the integration zone. For me short sentence, it’s nothing new.
We’ve been doing it for years. or I’ve spoken with colleagues who even describe Datamesh in similar ways. Aren’t we doing this already? Even data contracts was, which is another topic with another professional I was spending or just having a beer with in, in Europe.
Isn’t that an interface contract? Yeah, it is,
Shane: , I agree. So for me, I, see Magellan architecture as, a layered architecture. So what I try and do is go back to core patterns, and all it’s doing is describing layers or buckets, and you set a bunch of principles. Policies or patterns to say, this is going to happen in this layer.
This is going to happen on this layer. And this is going to happen on this layer. . , and I think you’re right. It’s been buzzwashed, market textured by a certain vendor, but this idea of a layered architecture has value. That is a valuable pattern. And, we compare it to the , big data days of Hadoop would just stick all your data on one big bucket and don’t have any layered architectures.
We ended up Layering the architecture in the code layer, ? So our code would logically layer the data and it’ll be physically stored, ? So we always layer it because otherwise we get a big bucket of crap that we can’t deal with. Now, what’s interesting for me, though, is terminology’s changed over time.
We used to talk about persisting staging area. We used to talk about the EDW layer, retail presentation, ? There’s been lots of different words. Layer one, layer two, layer three was always a good one because then you didn’t have to be clear about what each one did. . And in our product, we talk about landing history, design, and consume.
I can say landing in history kind of feels bronzy to me but we treat them separately for a bunch of reasons. And then in your diagram, you’ve got Data Vault sitting in silver, ? You’re saying the core modeling technique for Vault is silver. Whereas there’s a book coming out on O’Reilly around implementing Medallion architecture.
And the designed layer, ? The conformed coherent layer seems to be in gold. And silver seems to be just a cleansed version of the source system data, maybe an ODS, and I’ve only scanned it, ? I haven’t read in detail. One thing I love about patterns is they should be relatively clear with a good boundary and a good description.
One thing I hate about marketecture is actually it’s so bloody confusing that it has no value because Your silver is not their silver and probably not my silver So it’s around being very clear what it is and what it isn’t so somebody can adopt it you’ve done a really great medium post which I’ll put into the podcast around it where you break out bronze silver and gold and then you overlay the Data architectures and a whole lot of other stuff.
So I just want to talk through that cause you kind of want to say map the aliases, bronze is a form of ODS, silver is a form of EDW, gold is a form of consumer, ? So is that how you see it? . Is that you’re basically taking patents that you’ve used in the past and then you’re mapping them to the, Medallion language to say this happens in this layer, this happens in this layer, and this happens in this layer to give you clarity when you’re doing an implementation.
Patrick: So it’s actually excellent the way you described it there that it’s cause in fact. That diagram or that article per se is something that, I could walk back a few years, started with a particular customer and they want to do, these layers. So we had a few more layers than that, but within each layer, like sub layers kind of thing, you could almost call them sub domains almost.
And in fact, What I put out there was sitting in my backlog of things to talk about or describe for over a year, in fact. It’s since joining Snowflake it comes to maybe November where everybody starts to quiet down and then, I’m sure you’re the same. Where you’ve got this backlog of ideas.
Okay. I’ve got time to, really think about these things more, more, or mature them, or maybe it’s worthy of having its own article that I’d like to get out there because one, I’m not dictating what, how things should be, but this is what I’ve seen work. I’m happy to discuss it. Happy to for it to be challenged.
that diagram there, excellent way that you put it, that, one person would say gold is this or silver is that. That architecture in particular has evolved through, let’s say, three, four years because in my current role, When I join at a customer site there, especially if it’s they’re going green falls or vanilla, the usual what do we do?
What are we aiming towards? And we whiteboard and whiteboarding is remarkable by the way. I’ll throw that in there. And one of the first customers, it was really extremely helpful. It was just after COVID happened that everybody got into the same room.
We knocked out what our layers, what we want in, a one hour workshop, and I’ve taken the same pattern and you always work with very smart people, very intelligent people who know what they want and then say, Oh I like that. But what if we do X, Y and Z? If you notice in that article there, I tried to parameterize all the layers, but every single one of my layers starts with those letters, lib underscore, right?
Before working with this, One particular customer, I didn’t have that. I just went, cell, cow, bell, which is another common pattern. But he’s a data scientist. He’s a head of some department and he liked the terminology having these locations as libraries. So LIB library, and sAS background, you got some work in SAS, everything was a library.
Actually, that’s a pretty good idea. I like that idea because we also expanded what we had , as Salcow Bell that I presented. So while I’d really like a lab area, Oh, actually, that’s a good idea. What’s it for? What do we do? And the, his requirement was I want my lab to have access to everything I said and this is where my experience came and said you don’t really want that.
It’s a great idea. You do want access , to production value data, but you don’t want to, just give access to all that data willy nilly. We need to design this properly. So that’s how that pattern evolved as well. And I’ve subsequently, memory hold that. And when a new customer has a similar request, what do you think about this pattern?
And then the pattern has evolved since then, and with that QE based customer. When I joined based on an existing blog, they already had some picture from their blog in one of their mirror boards. Hey, that’s mine. Yeah. Yeah. We like some of these things. Yeah. We don’t like this, but we like a bunch of things here.
And then I worked with them as well. And then that has evolved as well, which is awesome because I think that’s one of the things of consulting. Is that you get to work not only in different industries, but you also work with very smart people around the world, data sciences, engineers, and whatever.
And you can see, okay, actually this pattern that you guys have suggested, I like it.
Shane: And you iterate, ? Like you said, I remember a diagram I did many, many years ago, which had seven layers. . Because I wanted to be very clear about where something went, if you’re going to do that thing, that pattern goes in that layer, and it has to go in that layer, and it became like this immutable architecture that didn’t work, because actually every now and again the rules had to be broken for whatever reason.
And then what I found was, as you explain it, it became too complex. It was , Oh, you know, seven, that’s too many. I’ll just go work with somebody that does three. And I’m like, yeah, but there’s actually 12 in there, ? If I break it down.
Patrick: So , my layers had seven as well at one stage, by the
way. being able to articulate and teach or coach or share a pattern in a way that’s explainable and then actually being able to implement it. So I just want to go through yours because with Data Vault and with Medallion, , or late architecture, there’s some intriguing stuff in there for me because it aligns with the way I think and do it.
Shane: But it seems to be different than the way some other people describe, especially with a data vault architecture. And so the first one is in the bronze, I think you’re talking about a persistent staging area, ? So I’m going to describe it. You’re saying that the physical structure of the tables or files mirror the source system as much as possible.
Is that true? . So, so the structure is based on how we see it in the source. Okay. And so often in the data vault world, we’re told no persistent staging that the raw vault the design model data is actually the initial Layer, ? So it becomes the bronze layer and then maybe there’s a transient landing layer where data kind of turns up briefly and we do something with it and we delete it because the raw vault becomes the immutable, trusted representation of the source data over all time.
So did I read that right? That you are, in your current architecture, you’re saying there is a form of source specific structures that are persistent.
Patrick: This is also my let’s call it pet peeve about DataVault in the globe. There are multiple definitions of the same thing, which you found I found. And if I was to label it as some of it is blog based data vaults. So people read blogs and then don’t know which data vault they’re talking about or describing which confuses a lot of what’s been put out there.
The way I have defined it, or at least I’ve consulted on it, is yes, there’s data landed as it looks like from the source. And it’s, let’s say, landing zone, bronze zone, or whatever. And it’s got a whole bunch of things in there. Perhaps things that should not be in there and perhaps things , that are superfluous to our business case.
But, sometimes we don’t know a business case comes in and says, I need this data and you get more than what you asked for. That’s fine. It gets curated. And then it goes to the next layer. Now I split landing from Raw Vault and Raw Vault is where I think Hubs, Links and Satellites live.
And Hubs, Links and Satellites in Data Vault 2. 0 at least, do reflect the source, but it’s the only place where you have applied some conformity. So satellites look exactly like the source in fact, but they have been conformed into a true change structure. So your source could be Snapshot, could be incremental, could be whatever.
Could be semi structured, could be unstructured even but when it gets to the raw vault, we’ve taken what we needed, because not all data is of equal value, and structured into a raw vault satellite which then only tracks true changes. I could get, Snapshots for 10 days in a row. But if nothing’s changed, then, you’re not going to get new records loaded into that satellite.
We also load link tables, raw vault link tables, which depicts the business process that we’re interested in getting from the source. And we, of course have modeled the hub tables. I’ve seen some implementations where they’ve gone, Oh, yeah, we’ve modeled 300 plus hub tables. What are you doing?
That doesn’t reflect the enterprise vision of your business. How can you have 300 hub tables? You should have hub, account or hub custom or something that is something that represents enterprise vision. So hubs become your, let’s say, your pin that it holds all your data platforms or your software landscape together.
And the links and satellites you can have , as much as needed, really. So the other reason why we want to split it is what I said earlier, that not all the data that is landed in the landing zone is of equal value. You can easily truncate that, and purge it, and archive it if needed.
And the auditable data is in your data vault, and it’s only tracking the true changes against it. It also gives you the opportunity to split your identifiers into its own satellite table, and So PII satellite. Should the GDPR article 17 come along, the right to be forgotten, you only have to manage that satellite.
And still have that aggregated analytical value coming from the rest of the data vault model structure. And I’ve written an article to suggest, only a suggestion because These things require much higher level above my pay grade input, as to say what is identifiable for a person, account or whatever, where in that article I suggested, okay, look, you don’t delete that record per se, you obfuscate it permanently.
So it’s the one place where I suggest. You take the original record and then you scrub it completely, but don’t change the hash div. Because if a source system happens to have that same person arrive again you can detect that and it will not get loaded. You record that detection, say, hey, hold on, that source system still has information about that person we shouldn’t have data about anymore.
And that becomes even more intelligence above the Data Vault intelligence per se. The other thing that I emphasize in the Medallion architecture, and it’s a way I deliver Data Vaults as well, is when you take that raw vault as hubs, links, and satellites, which is an interpretation, of course, where does the business vault come in?
I’ve seen in the wild that there’s different, wildly different interpretations of business vault. The one that I feel works really well is having it sparsely modeled. So if you needed to transform that raw data into something that’s not available in the source system, or some intelligence that you feel that needs to be in your data vault, you simply expand it with sat underscore bv underscore whatever with those persisted attributes from that business rule that you’ve developed.
And that inherits the same auditability as the rest of raw vault. So you end up with hubs, links and satellites, which are purely raw vaulted and reflects the source. But you also end up with extensions of that raw vault being business vault links and business vault satellites with that intelligence that is sparsely modelled.
And you can then use in things like pits and bridges and information wards and deliver what you want from it.
Shane: It’s interesting because Vault is so well described, ? And there are a set of immutable patents you can’t break. A hub only holds the key and some extended metadata and never holds context. That is always in a SAP. A link is always a relationship between hubs, ?
Ideally around core business process. There’s a bit of vault shaming out there where the other patterns that aren’t so immutable, they’re opinion based and they’re valuable. You get shamed if you don’t follow them. And it’s going to be interesting with Medallion architecture when it starts becoming more opinionated.
Whether we start seeing medallion shaming. So let’s just go through a couple of those. , cause these are some of the ones I struggled with when I started my data vault career earlier, . When I started using it. So,
uh,
Patrick: was that, by the way?
Shane: Oh, probably only, only about a decade, maybe 12. So I’m still a newbie.
I was working in teams where we implemented it. I wasn’t physically implementing code myself. . So one of the benefits of Vault is this concept of satellites, ? It is a pattern, it is an immutable pattern within Vreesim. And one of the benefits we get for it is it gives us change capture detection, ?
It gives us , historized records over all time, all changes, and that is an immutable pattern within the SAT.
Patrick: I like to use Roland’s definition. He calls it only true changes.
Shane: And so, we get SCD2 for free, basically, because that’s what the patent implements, ? I would find very rare reasons, very rare to have a context where we don’t implement that patent. In fact, even then, if I’m not going to implement that patent, I’d say use a different modeling technique.
, use something other than Vault,
Patrick: you mentioned earlier activity schema. I did do experimentation and I wrote an article on how to do that in Data Vault. Because it’s pretty similar in the sense that it breaks up raw data into transformed data, your business vault, which it layers very nicely.
Not strictly vault, but it actually, you can reuse the same sort of flow and architecture for it.
Shane: If I want to rack and stack a bunch of core business processes, custom orders product, customer pays for order, store ships product or order, customer returns product, and I want to see time between, , how long does it take between the order being placed and being paid I want to figure out numbers that move, then Activity Schema racks and stacks that data in a way I can answer that question really quickly.
Patrick: And repeatedly, like any pattern that’s the selling point of activity schema.
Shane: And then I can apply that to feed off links. I can apply that to feed off hubs and sets, . And that was where we’ve been experimenting, ? Do we say we’re only going to feed off one of those or do we make it optional? Because, when we have choice, it comes with a cost.
it’s the same as going from hubs and sets to a DIMM. It’s a repeatable pattern. So one of the benefits of these modeling techniques is you should be able to write small blocks of code that get reused time and time again, ? Like a YAML template, effectively.
So then we get to some that.
Become more context sensitive, ? So, source specific vaults, ? This is where, effectively, the hubs are bound to a system. So, let’s have an example where I’ve got Salesforce as a CRM and I’ve got a e commerce ordering system that has customer, ? So, Salesforce holds customer, the e commerce store holds customer.
Now, ideally, The vault pattern says conform them as a single hub, ? Grab the keys, rack and stack them in that hub, because it has value. And then each of those source systems can have a set where the raw version of that data is sitting against that. And there is value in having this one hub that holds a bag of keys effectively.
And yes, we have things like key collision and conformity of the keys and a whole lot of complexity that comes when we, you know, Grab data from two systems that’s not aligned, where the customer ID in Salesforce is not the customer ID in the e commerce store. But racking and stacking them into that hub has value.
However, if our context is there’s more value right now of implementing a source specific vault pattern, then it’s okay. You just got to justify why that pattern has more value than the default pattern recommended for vault.
Patrick: That’s a good point because John Giles has written excellent articles on top down modeling which is exactly what we always go for. And your source systems will change over time as you mature. And we’ve seen that a lot. Agree. That there is some source vault modeling involved.
Then the question comes down to where do you solve that complexity? And I’ve seen, and I’ve worked with, and also designed different ways of doing it, depending on how complex the problem is and how much baggage you want to carry. Because former colleague maybe you’ve worked with him or spoken with him, Noel Zieberson.
I love some of his, one line analogies. One of them that sticks out is you have to pay the Pied Piper somewhere, right? And I think it’s also something of a challenge sometimes working with data engineers in particular, because data engineers focus on automation.
Why do we need to do some of this data vault modeling complexity, right? And for me it’s. You have a choice. Either you’re paying for that let’s call it technical debt up front, or you’re paying for it later. And the way I see it is the sooner you deal with that complexity, the cheaper it is, it’s going to be for you.
Because the later on you, you try to solve this in your, let’s say, Medallion architecture the more you’re going to have the problem peripherate across your business. And there’s a whole bunch of problems with doing that. One is as the whole world Modern datastack world moves to consumption based pricing.
You don’t want to repeat the same complex problem over and over again. You want to solve it once. Second problem with that is doing the inverse, of course, is if you don’t solve it up front, then you’re leaving it up to the users to solve it. And who’s to say they’re going to solve it the same way? So now you’ve got multiple versions of the truth, right?
As you say bringing those raw vaults key complexities solving upfront, you’ve got choices either the best choice is something I used to call, it’s actually using Medallion sort of terminology. I said, first prize, Gold, second prize. And I didn’t even call it last prize.
I called it wooden spoon. You want something, it’s just no good. You’re going to have to pay the complexity all the time. First off, can we solve it in the source? There is a 50, 50 chance it won’t be because let’s say your source system is SAP or Salesforce or something, they’re not going to change the way they do things for you because they’ve got a huge.
Customer base of their own that they need to look after. Okay. So what’s the next choice? Pre staging. Can we solve this stuff before it even gets to staging? It’s a possibility. How complex is that going to be, or do we have the budget to solve it now? Is there some pattern we have to build or is it something we can solve temporarily or whatever?
Then the next is, okay, we solve it in business vault. We’ve brought everything in RawVault, it looks like the Raws, but we solve it in BusinessVault. So now there’s a cost. There’s this consumption that’s happening that you’re persisting as BusinessVault structures, but there’s that complexity you now have to take on.
The last place, the wooden spoon, of course, is if you leave it to users. Because you’re going to take that same logic and you say, yeah, run this. So you’re going to run it every time. It’s just going to explode. So ideally, that to me is one of the key points of doing data modeling properly is it’s going to be cheaper for you in the long run, and you need to solve that complexity upfront because not only are you dealing with the cost of running the same query over and over again, you’re simplifying what is the definition of those things that you’ve described.
Whether it’s in the business glossary. Defining business terms or modeling the business concept model.
Shane: For me, it’s about once you’ve made those decisions, articulating what they are so anybody else can understand them as much as
possible. So I’ll give you an example, ? In your diagram, we talk about bronze, bronze being a persistent staging, ? Matching source system data’s there.
We talk about silver being the, coherence, or what I call designed. We’re going to model, design, conform, change the data, so it’s fit for purpose. And then consumer domain or intelligence zone or gold is around how people access it and making it fit for purpose for their, their consumption.
And so I can look at your, design and I can go my guess is if we were going to conform customer, if we had customer keys and different systems that are either Based on the same rules or not, we are not going to conform it in bronze, ? We’re going to do no work to match those keys in bronze.
We’re going to do all that work in silver and we’re going to expose that work in gold so nobody else has to do it and then, we go, okay, so now it’s very clear for me that conforming keys, single view of customer is silver. Now we come to the next level of complexity, ? Are we going to have a single hub that has all the customer keys or for whatever reason, are we going to have source specific hubs to begin with?
And then do the conformed hub later. . And good practice is conformed hub, but there may be a reason that I want a Salesforce customer and a e commerce customer to begin with, but I have to describe why I’ve picked that pattern because it’s not the standard pattern. . And I should be able to be challenged on that to say, actually, given that context, that’s not the right pattern.
But once we know what it is, , then we can change in the future, but change has a cost. The next one is BusinessVault, ? Cause BusinessVault is the one that’s kicked my ass so many times over many years. And so good practice says now the pattern is raw holds the data, and then we create additional BusinessVault objects where we need to infer, conform, do change, ?
Where we have to touch the raw data and do bad things to it. So it’s fit for purpose. We do that as a BusinessVault object. But it’s just an extension of the rule of old structures. Whereas sometimes, and in the past I did this, I would go and copy the data from rule to business. For a bunch of reasons.
Patrick: Okay.
Shane: Now, I will still do that sometimes if I had to, ? If there was a reason why taking a copy of raw and representing it as business vault with the inferred data attached, and that raw to business is a direct copy of most of the raw things, if for some context it had value to me, if that was the best way of getting the work done, And I can live with the duplication of data, and I can live with automating it to make sure that raw and business records that are one to one don’t get out of sync, then that’s okay, ?
But as you can just see, I’ve just added a whole lot of costs, ? Increased storage, increased complexity of synchronization, the chance that it’s going to get out of sync, automation of that movement over and above the inferred. So I have a whole lot of complexity I’m introducing to adopt that pattern.
But it’s okay, as long as I describe why I’m willing to infer that complexity. For you, you’ve got raw and business sitting in silver and I think later on you go through about, business being inferred objects, ?
Patrick: . Okay. So a couple of things I avoid duplication at all costs. And that sparsely modeled business fault is purely like. One of the principles I like to infer with customers is, let’s talk about business vault in particular. What are we doing? We’re capturing some soft rule or some change that is not able to be solved in the source system because ideally and I say this just so that I can plant a seed in a customer’s mind is, in a perfect world, there would be no business vault. Everything should be solved by source. Think about it that way. So you need to keep your business vault as small and tiny, smallest footprint as possible. But then, why do we have any of this conformity or anything in the first place? Raw, Sources are not always going to have the data you need in the form that you want.
So we do have business vault, but does that mean in my mind that you replicate role vault as it’s, it maps from the source as this attributes called this and this attributes called that into something that’s now called something else because it conforms to the way that you’ve defined things.
Couldn’t you just do that in the view? So in , that architecture pattern that I’ve got there I’ve got raw and business vault in the same layer because I’m only extending what I need and why I’ve got the business vault there is because I want to say to consumers, which have their own layer, own domain, the gold layer, pick and choose what you want.
And here’s shared business rule outcomes. Cause the other point that I always want to make before I elaborate on that one is that. I like to define to customers say, look, think, also think about business vault as a separation of business rules and the outcome. So when you’re creating the business rules, there should be declarative, there should be item potent.
Meaning that if I rerun it over and over again, I will always get the same result no matter what. By separating the business from rule of vault. I could implement any rule I want in any language I want. It doesn’t matter as long as it’s still declarative. So if that rule seems to be written better or prefer to be written in Python, for example, it shouldn’t limit how I use the data vault in any way or form.
The outcome of that would just be mapped to business vault links and satellites. Because hub tables come from source. Business keys are not defined in the data vault. They are mapped in the data vault. And that just extends what I’ve got in the middle silver layer. Now, I have intelligence zone with its own business fault there.
Notice because You could reuse the same patterns and reuse the same philosophy to extend the integrated layer with your own satellites and links that are maybe shouldn’t be shared with other domains. So you’ve extended it into your own consumer domain. Define them as your business vault per se, but in essence, it’s still extending the existing raw vault.
So it’s not redefining hub tables, not redefining any links or satellite. It’s just taking what’s shared and saying, I need to have my own intelligence solved in this way. And there’s good reasons for it. Because this is a pattern that I presented at that customer, and they’ve taken what they liked and, thrown away what they don’t like, of course.
They like this concept of extending what we call their own private business vault. Because let’s say, of course, our goal is, single version of the truth. But what if you have a department, let’s say treasury, where the data should not be shared? Or I have these transformations that in fact, if you have access to it, it may even pollute what you’re trying to get out of the same integrated vault.
And that’s why we’ve extended it that way.
Shane: And that’s where it’s key, ? Because when I look at the diagram without the context, I’ve got two business vaults and business vaults are a thing, and that’s really weird. But what you’re now saying is we’re using the business vault structure, , so we’re using a business vault modeling technique, which is the same as raw of hub sets and links.
We’re now saying that it’s following a only new, , so where we need to change or infer data that is the only thing that goes in that vault. And that can be in the silver vault, business vault, or the gold business vault. But now the demarcation between those two patterns is who can do the work to create the objects in that vault gold versus silver, and who can consume it.
So now it becomes around data, team design, And operating model and
security
Patrick: This pattern was based on a customer who was doing a hybrid form of data mesh. Yeah. Yeah. So they wanted people to own their, let’s say copy of the data. It’s not really a copy, but their their business case with the data. Absolutely. So you mentioned the word stream aligned.
Absolutely. That’s exactly what we were going for.
Shane: It’s about explaining those patterns. So now I can know when I do work in the shared business vault and silver. Or when I do work in the private business vault in gold, because you’ve given me the rules, ? And then I know where to do that. So another example, , is where we rename fields, yeah, where we rename attributes.
Again, it’s one of those things that you should be very clear where you expect it to happen, do we rename the fields in bronze? Well, with the architecture you have and the architecture we run, probably not. Not very often, we do every now and again when the name of the field breaks the tool that’s landing, that we store the data in
Patrick: you could simply do it as a view, right? Because we’ve got multiple consumers, they may have their own preference for certain things, which is fine. When I worked with this customer I, again, I learned a lot because at that time I had heard of domain driven design. But I hadn’t really spent any time actually studying what it actually means.
So when I joined them of course you want to deliver and you want to consult as best as possible and still make sense of what you’re delivering. So I read Eric Evans book which particularly the strategic design perspective was excellent. It actually gave me a lot of material to help me.
Work with and at least deliver a vault that they’re expecting. And one of the funky names that they use in domain driven design is anti corruption layer which I love that, name, because it implies that whatever happens in this layer does not corrupt the downstream area, which is all around microservices, ?
Actually, if we’re going to serve all these domains, these consumer domains, why don’t we Create anti corruption layers, like you see the data this way and you don’t need to persist it as physical structures. You just use those OLAP patterns of star joins and everything and deliver it as a view because it will be available.
If it’s, if it starts to prove that it doesn’t, then we might use things like dynamic tables or something like that. But mashing those two concepts together sold it really well with the customer. Another example of where domain driven design. At least the terminology or the linguistics of it helped us a lot is when I was describing the Hub Table as well.
We earlier talked about, conforming keys and everything like that. We at least the way this team was working is describing source models and source systems and relationship to ourselves. They talked about different patterns like provider, consumer,
there’s different patterns in how you treat that, source system, because maybe it’s a legacy system and so on. But after reading that book, I said, guys, for the hub table we should define a relationship, which comes from the book as well called the shared kernel. What does that imply?
It means that this artifact out of our data vault, you can independently build your links and satellites that you need to solve your use case, but that artifact is the one thing that we need to share and manage, uh, crucially for the the platform success, because like we discussed, you will have account keys
that maybe come from a source system, but don’t map to the.
account key that that actually the business will refer to. Because they’re not going to care about a Salesforce ID or KSafe ID. They have no idea who cares about those digits? I want to know what the account number is, ? So , where do we solve that complexity?
How do we manage the potential collision if two systems come together and they have, The same key value, but are actually two different business entities. And that’s why , that shared kernel concept became a crucial thing for us to manage within the platform.
Shane: For me, I do conceptual modeling based around some DataVault y stuff, some core concept stuff some who does what from Beam, ? That’s how I think about it. And
Patrick: I
use some of them as well, by the way. So some good stuff in there.
Shane: Good patterns, ? And so I talk about core business concepts, , customer, supplier, employee, but often what we’ll find is weak concepts.
Something where we’ll see a hub, but it’s not really one that’s used everywhere. . Or we’ll see an admin concept, you know, status change on an invoice, . It’s a weak concept. If we talk about a process, somebody enters invoice, somebody reviews invoice, somebody approves invoice.
Those are admin processes and they. Create weak context but they’re still valuable for reporting. And what you’re kind of talking about is master concepts, ? So this idea of master data management coming in as a pattern, which is that core concept to customer.
Actually should be mastered once and reused everywhere,
Patrick: I have an upcoming blog on it, by the
way.
Shane: excellent. So, you know, single view, single list, all those kinds of behaviors. Let’s go back to that renaming though, ? So there’s many reasons we will rename, ? We’ll rename so it fits the data storage platform that we use if we have to.
We’ll rename so that it kind of gives us more context around, and we may rename The physical column. If it’s a physical storage, we may rename the metadata or the alias or the description. If that’s what we use. We may rename to fit our conceptual model. We may rename to fit our physical model. We may rename to fit our business model.
Finance call that thing different to HR and we’re okay renaming them. The way you’ve described it. I believe that you’re saying you’ll pretty much do that in gold, ? Because now you’re doing domain specific naming.
Patrick: There is a sub layer within the integration area that’s called CAL, Common Access Layer. So there might be some conformity applied there. If you wanted to create a view as an anti corruption layer. Where, okay we’ve conformed these columns to these names, but it doesn’t impact the physical columns because, of course, a benefit of doing this renaming separate is let’s say that the source evolves as a column that happens to have the same name as what you’ve called something else, then what do you do, I like the idea from that we can, that we use a discussive anti corruption layer because also we might evolve what our column names look like at some stage. Use case, sure. And we can have as many anti corruption layers as we want. It gives us that flexibility and that’s, same data source for that anti corruption layer doesn’t change.
Shane: What I talk about is I talk about principles, policies and patterns, ? And how they’re different. I’ll give you an example. You said that one of the things you do is you avoid duplication at all costs. So that is a principle, ? Your principle is I am not going to duplicate data unless I have to.
But I guarantee every now and again you have to. You apply that principle as much as you can. And you apply the alternative when you’re forced to. , a policy is something that happens and if I don’t follow it, I get fired. The data stored in BRONZE will match the source system.
If I go into BRONZE and I start modeling and conceptually modeling and master data ing, , I’m breaking that policy, ? And ideally what we want is we want computational governance, ? We want to write code that tells me when somebody breaks that rule in that system.
Maybe I write some code which is useful anyway to say check the schema of that table in bronze, check the schema of the table we got given if they don’t match, somebody’s inferring, conforming, doing bad shit in there, flag it, , send them an alert so they can fix it, ? But then we get things like renaming and renaming is something that actually we use multiple times in multiple different places for multiple different reasons.
So our principle could be rename as few times as possible, maybe. We’re probably not going to have a policy around renaming,
Patrick: I think it’s going to happen a,
lot. Have you seen SAP columns?
Shane: We’re not going to have a policy of only rename here because that’s just bullshit and you can’t enforce it because it won’t be true.
What we now have is renaming patterns. , how do we describe those? Okay, so one of the things we want to do is we want to rename columns or fields to give it a business term so that it’s available to be consumed by the user. Where is our pattern for that? Do we ideally want to do it in silver and surface it all the way through?
Do we ideally want to do it in gold? If we’re using Power BI versus Tableau versus Dataiku, does it matter? ? Does the consumption tool, last mile tool, affect how we rename? There’s a whole lot of decisions. So now what we have is a repository of patterns for renaming, ? And you can go, well, this is the problem I’ve got, or the pattern that most people use.
is here. And what we know is humans are good citizens nine times out of ten, but they’re also lazy. In a good way, ? If I can find a pattern that somebody else has implemented, that has code, and I can reuse it, I’m going to use it because it makes my job so much easier and I probably won’t get yelled at, ?
Renaming is one of those tasks that actually becomes fraught and therefore is a very bad Large pattern base. The other one is data quality rules, ? The ability to claim data. That is the other core engineering practice applied in multiple different places and all the layers. So we just treat those as a set of patterns.
Do you normally put metrics?
Patrick: Matrix is a broad term. What do you mean?
Shane: Formulas where I calculate and fair a number, ? A divided by B, A plus B , percentage of change over time as a
percentage.
Patrick: depends if it’s something that’s repeatable, idempotent, declarative it should be a logical asset in my mind. And the outcome is the extension of raw vault, the business vault, satellite, or link. Got a few examples on top of my head, but the, one of the most complex ones was this example that I worked with customer before I even joined Snowflake where.
This particular customer is in the home loan market and don’t see value or don’t make a lot of money on credit card as a business. By regulation, I think I can be corrected when you offer home loans, you must have also offer a credit card product.
The business was like we’re not going to spend a lot of money on the automation software for credit cards, but we need the core things it needs to do. It needs to be able to stand up to changing credit cards as in it gets lost, stolen, new card is issued and you can have, a single card, like a isolated card.
Or if this customer wants to have secondary cards to give to his family or wife or whatever, then it needs to support those two patterns. And they went into the market and they selected a software that does that. Sounds great. Sounds good. Solves the whole credit card thing. They’re in compliance .
However, there’s future compliance. There’s more compliance that came out. One of them is being able to say, what is the credit card exposure for an account, ? Not for a card, but for an account. Because they need to report on this account to the regulators. So the sole system has no concept of an account.
It only tracks cards. What if we do that? For this particular use case, , the answer was right in front of us. The very first card issued to any customer is their account number. That never changes, whether it’s transferred to a new account that lineage and all those cards associated with that lineage is then grouped into that account.
So our first card issued to a customer is an account. That’s how the source is. I was modeled into a link table is just purely , on cards, but then in our business vault, we needed to represent it to what it looks like for the business. So we built pretty complex business rule implementation.
, it did recursion in the SQL and the output was this table that said, okay, this is the first card, that’s our account number. And then we populated it’s not really a same as link but it looked like one where, because it’s obviously referring to the same hub table, but one of them was being assigned that’s our account number.
And that was built as part of the integration zone because the rest of the business would benefit from that complexity that we sold up front, ? So the renaming and all the business rules and stuff, Had to be in, in a zone that’s going to be shared by everyone. Whether what they call it, like if it’s still cards or accounts is really up to the view that renames or whatever, which is again, our anti corruption layer.
Back then I didn’t even know too much about an anti corruption layer. So we just call it a view.
Shane: What I do is I talk about facts, measures and metrics, because I want to be able to teach people what the difference is and and the fact word is the best I’ve got, but it does cause confusion for data people. So when I talk about a fact, that is a number that’s come from a system that’s
immutable. so quantity. They are facts because they were given to us, and that is the fact of the number, ? We can infer something later, but that’s the fact. A measure is the things we do to it. We, we sum it, we average it, we
count it, ? There are all those, , aggregation types, ?
Because then I can measure, average, Amounts, total quantity, and then metrics for me are formulas. I teach people it’s like Excel, ? Where I go this divided by that, that times that, that over that, ? Those metrics where we’re inferring something based on a bunch of facts or a bunch of measures, ?
Or a combination of both. So once I give them that, that pattern semantics, then it’s okay. Where are they going to be done? And those ones, again, they’re like renaming and data quality. They get done all over the place to a degree. , so there’s value in having those metrics, those A divided by B’s, late, ?
In the gold layer, because then we can dynamically calculate them based on the context of the user, ? If I go average amount by product versus average amount by channel,
Patrick: They’re also not super complex and may not require an auditable backend to it. So it maybe doesn’t need to be saved in a physical business fault satellite or business link, right?
Shane: But, it could actually be a satellite table in Business Vault that’s virtualized,
Patrick: It can be. And that’s actually a good point because in my consultation is even recently, I think the last week, it’s like, Oh I think the architect is saying, we’re have virtual business vault as much as possible. Okay, maybe that’s a principle. Maybe that’s a policy.
I don’t know, but think about it this way. If you are building a view of a raw vault, And you’re applying these calculations. What you’re saying is that calculation is true for the whole history of that Rovalt satellite. Now, if that evolves, if that changes, what do you do then? Especially if somebody is trying to query a historical value.
from two years ago , does that business rule outcome still hold value, ? What’s the alternative? Are you going to start putting hard dates into your view definition, which now , you’re starting to build technical debts into your views itself. And every time you run that query against that satellite, you’re recalculating that every time.
You
Shane: or, are you going to build something really cool, ? Where effectively, when you hit that view, it’s parametrized, and then that’s it. That view is an aggregated view of a bunch of other views and it’s basically got a query
path, ?
. Which says, Hey, I’m querying this at this time.
So it’s applying the business logic that was valid as at that time against the data that was available at the time, but then, Hey, let’s go one step further, let’s let them choose that period of time. So I want to see the data at that period of time, but I want to see it based on a business rule as at now, ?
Now, what we’re starting to do is we’ve started to build a massive amount of complexity.
Patrick: Absolutely.
Shane: But the question is, is it needed,
Patrick: And what happens if you want to change one of those views in the stack?
Shane: Yep. What breaks? And then what happens when we put a trillion rows in there every day? Or a billion rows.
So at that stage, we’re going to want to materialize it or physicalize it, ? We’re going to want to have the machine, the expensive technology we’re using, take care of some of that response, ?
These are all the things that happen with patterns, They’re not immutable,
Patrick: I’m not saying I’m outlawing views as business fault artifacts. What I’m saying is you need to decide how you define what is physicalized and what’s a view. If it’s something like that’s cheap, that’s, this divided by that. And you’re okay with the view. Go for it. You don’t need historical, tracking for all time for this value.
Don’t do it. But if it’s complex, like the business rule that I described, where it involves recursion, calculating what is their account, and it has to deal with new accounts and existing cards, whether it’s in the same account and stuff, then you don’t want to be repeating that all the time.
Shane: And so to answer your question virtualizing the business vault is typically a principle. , ideally, we want to virtualize the business vault as much as possible, ? We want views, that’s our principle. And then we have patterns that say, here’s how you create a view in business vault. And here’s how you create a physical table or whatever when that view doesn’t perform or doesn’t deal with the context of the logic, ?
Because now what we say is it’s a principle, not a policy. Because a policy says if you ever physicalize a business fault, you get fired. It’s immutable, ? It’s a policy. that’s how I think about it, because otherwise we get too dictatorial when we know it’s not going to scale, ? Where we know there’s always going to be use cases that break the majority of our rules, because they always happen.
And that’s okay, but we don’t want them to happen by default, ? We don’t want everybody writing their own dbt scripts that have no conformity across the organization, and when I end up with 5, 000 of them, I’ve got a problem. Time wise, we just need to close it out, ? So one of the things that’s interesting for me.
Is semantic language, ? Because you’ve worked in the data domain for a long time, like me, you switch terms interchangeably. You talked about landing, you talked about producer, you talked about intelligence, consumer, gold, silver,
Patrick: It’s the world we live in, mate.
Shane: and so the key thing we’ve gotta remember is data people is we know we can switch it in and out.
And because we’re talking patterns, I can keep up. When you change the term of what you are talking about. I could still see that diagram and go, oh yeah, no, he is talking about that one. But actually people that haven’t been in the data world for a long time or aren’t as experienced can’t, ? So what we want to do when we talk about these patterns, so we talk about Data Vault, Medallion, whatever, we want to see if we can standardize our language to other people as much as possible to help them come on the journey with us.
And then the other thing we want to do is use analogies. , so you use some great ones. You had some great ones from Knowles and Rowland. And the ones I tend to use for Medallion for layered architectures, I talk about kitchens. , I talk about a storeroom where data comes in, a kitchen where we do all the heavy work, and then a server or a consumption restaurant where people use it.
And so for me, Medallion architecture, like you said, it’s been around for ages. We had layers, ? We, you know, And I tend to model my data architectures like you do. I have a storeroom where I just land the data in and it matches source and I rack and stack it and make it immutable for me. I have a kitchen where I go and design it and change it and make it fit for purpose and do a whole lot of bad things to it because the source system didn’t do it properly.
Then I have a place where we can consume it and I’ll apply different modeling patterns and different other patterns into that place where everybody else goes and gets it. And then I have , some policies, ? Does everybody have to go to gold, Do they have to go to the servery or
are they allowed in the kitchen?
Patrick: With this New Zealand based customer, we actually said you don’t have to model everything into, to DotaVault, if your use case. You need an answer now. Go straight to, you consumptionally if you need to, but if you want to inherit the maturity that we’re building into the platform, then you should consider modeling it into, to the data vault.
Shane: And it’s context, ? If you want to go build it yourself and maintain it yourself, then go feel free. We still needed them to have the conversation about could they see gold only, or could they see silver or could they see bronze as a policy, ? Who’s allowed to go into what layer and then to do that policy.
We need to know what’s in the layer and what isn’t.
If I’m a machine learning expert, potentially I want to go hit bronze because I want all the behavioral raw data, because that’s what my ML model needs. I probably can’t use the consumer data in gold because you’ve done a whole lot of stuff to it that will affect my model.
Potentially,
Patrick: essentially, because I’ve had customers who want access to both because the outcome they wanted to build as a business fault, ? In that diagram, we’ve got an area called lab. Lab, you either had the intelligence area where each domain has their own lab to do their own thing, or we had the shared lab zone, , but the outcome ideally was you’ve built this engine or rule, the outcome should be either potent, meets the same business rule criteria, which principles you shouldn’t run the same thing and have two different results, if that doesn’t make sense,
unless you’re an LLM.
But, but the outcome should be repeatable that you can say, all right, we can either extend the shared business vault or shared data vault with our extensions. Or this needs to be a private thing that actually one of the outcomes was a metric vault as well. So not metric as in the intelligence, like Splunk type of logging stuff, but actually like a metric store where it’s used downstream for inference.
Shane: It’s a bunch of policies. You can say the policy is a machine learning engineer is allowed to hit raw, allowed to hit bronze, but their results have to be persisted in silver or gold. Two policies, policy of access and policy of where you land that data so it’s managed in the future.
We want to write Rules, we want to write code that tells us when that policy has been broken, ? So we can go in a conversation to them and say, OK, that policy was broken. It’s going to be for one or two reasons. You didn’t follow the rules. So I’m going to go and give you a fine or whatever the corporate.
That’s what culture is for telling somebody off, or you’ve found a context that breaks our policies and this was the only way you’d do it. What was the context? What are we missing? Because it means we’re missing something, ? We’re missing a use case, we’re missing a behavior, we’re missing a pattern, and then we update our pattern library and we update our policies in there.
Patrick: We definitely set principles up front, but we also as a principle on its own said that these can evolve. Think of these as guidelines instead, if you do have a use case that doesn’t fit our patterns, then what is it? Because it could come from different angles.
It could be an actual extension of the existing patterns, or maybe you’ve interpreted something in a way that. No, we’ve solved for it, but everyone’s human. Everyone, hopefully everyone’s human. Everyone comes in with their own perspective on what they’ve done in the past or their experiences. And perhaps it does fit an existing pattern.
It just hasn’t been interpreted that way.
Shane: Yeah, and writing up patterns so somebody else can read them and understand what they are, the context they are valuable, and how to apply them is bloody hard. Even just writing that article you wrote around how you apply data vault . Models to a medallion layered architecture, ?
The number of contexts that affect those decisions is massive, ? And so, you talk about batch and streaming, ? We haven’t even talked about where that happens, ? And so there’s a whole lot of lenses, ? And that’s why patterns and clear articulation of the patterns you’re applying are useful.
Look, we’re out of time. So thank you for that. If people want to read more of what you’ve written, get hold of you, what’s the best way for people to find you?
Patrick: LinkedIn is the best. I respond to LinkedIn if not immediately in a few days. I do continuously still push out articles and stuff, and usually it’s based on not just what I’ve observed, sometimes out of experience with customers around the world. Because I’m.
Kind of filling that, old Kent role in terms of advising on how to build a data vault and I’m not prescribing a hundred percent, you must do it this way or that way, but at least giving guidance so the best way I think is LinkedIn.
Shane: And then you write primarily on Medium. So people can go and find your Medium articles where you publish that. And then again, shout out to, you’ve got a book on Amazon. So people want to see technical implementations of Vault and how to take the concepts and
actually apply it.
Patrick: very technical. Yes. Yeah.
Shane: But it needed to be written , because there were lots of books on how to model it, but not how to implement it, so it filled
that gap.
Excellent. Thank you for your time. It’s been great, and I hope everybody has a simply magical day.