The patterns of Data Vault with Hans Hultgren
In this episode of the AgileData Podcast, we are joined by Hans Hultgren to discuss the patterns of data vault. Data vault is a data modeling technique that helps organisations to manage their data more effectively. Hans begins by explaining the basic concepts of data vault. He then goes on to discuss the different patterns that can be used in data vault, such as the fact that there are three different types of tables: hub tables, link tables, and satellite tables. Hans also discusses the benefits of using data vault, such as the fact that it can help to improve data quality, reduce data redundancy, and make it easier to perform data analysis.
Here are some of the key takeaways from the episode:
Automation: Data Vault is designed at its core for automation. It allows you to write code once and use it many times, hardening the code over time.
Immutability: Data becomes immutable in the Data Vault. Any occurrence in the data is stored forever, making the data highly reliable and history-rich.
Adaptability: Data Vault can adapt to changes. If something needs to change, a new link is created, or if the data changes, the Data Vault absorbs the change.
Re-engineering: New sets can be added with very small change on the rest of the model. New links can be created, and history can be repopulated from previous sets into new ones.
Incremental Approach: Data Vault allows for an incremental approach, delivering small bits of value fast. This aligns with agile patterns.
Dealing with Master Data: Data Vault handles master data, enabling a single view of customer data, for example. It doesn’t make the matching process easier, but once matched, storing and using the data is simplified.
Refactoring: You can always refactor with minimal re-engineering required. When building a new set or refactoring links, the old ones are kept and a data migration is performed if needed.
Lower Total Cost of Ownership: While it may initially seem like Data Vault may have a higher total cost of ownership due to many tables, both the initial and total cost of ownership is lower. This is because you can start with one small component that once finished, does not have to be re-engineered again. Future additions are also easy to add, making the process more efficient and cost-effective in the long run.
If you are looking for a data modeling technique that can help you to improve the quality, reduce the redundancy, and make it easier to analyse your data, then data vault is a good option to consider.
GuestsHans Hultgren Shane Gibson
| Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser |
Read along you will
Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.
Hans: Hans Hultgren here.
Shane: Hey, Hans. Thanks for coming on the show. Today we are gonna talk about this thing that I’ve had a love of for quite a while the concept of the data ball. Before we rip into that, why don’t you give the audience a bit of a background about who you are, how you’ve fell into this world of data vault and what you do.
Hans: Awesome. Will do. Shane. Great to be on with you. We’ve known each other for quite a long time and I’m excited to be here with you today. So thanks for that. Yeah, I’m Hans Hultgren. My background is primarily in the IT world. Starting , probably the better part of 40 years ago. I was teaching at the University of Denver here for quite some time.
Started a program. Back in the day, the first master of science and data warehousing business intelligence that we had here in the us. And then from there started with bolt in late nineties, 99, 2000. And we started a company now Genesee Academy primarily focused on training and certification in these patterns and methods around data vault.
I’ve been doing that now, obviously for the better part of 20 years. That’s of course how I got to know Shane and spend some time down in your beautiful island.
Shane: I’ve been lucky enough to visit you in Denver and go up to, Red Rocks and the amphitheater up there and Evergreen. And I gotta say that part of the world’s got a special place in my heart now. A bit like Vault, there’s a few patterns in the world that I stumbled across when I was a consultant and had a consulting company.
A few of them, I feel like an ad on TV over here years ago with a razor I think it was a Gillette, and was the owner of the company. He said, I love this razor so much. I bought the company and for me, Vault, one of those things that I stumbled across it.
I started to use it. I fell in love. And, my consulting company used it a lot. We used to do a lot of work with you around the training to, to help people in New Zealand understand how to use this thing of beauty. And for our startup Vaults, under the covers, in the backend of what we do for agile Data Vault fundamentally the data modeling technique we use at the core.
And why is that? Look, I’m not a data modeler, I’m not a technical person. My co-founder’s, the techie I can understand patterns, but , I’ve never done data modeling early in my career, I always left it to somebody else that did Kimball, lots of those e r d horrible diagrams that I love to hate.
But when I saw a vault, I saw a set of patterns I could understand. Something where it was an approach that was based on patterns. I could understand those patterns and I could apply them, I could explain them. And for me, that’s why it enabled us to democratize data modeling out to a whole lot of people rather than what we had at the time, which was, one specialist that sat in the cupboards typically, had a bed scratched the bed for six months, came out with this beautiful entity model for the whole organization, which we could never implement.
I’m sold and have been for years, what I’m seeing in the world is a lot of people don’t know what Volt is, don’t understand how it works don’t understand why it might be suitable given their context. So let’s go back to basics. When we talk about data vault as a modeling technique what’s the core of the patterns?
What’s it about? How do you describe it when you teach somebody?
Hans: Yeah, so that’s a really good point. I think that I’m probably as surprised as you are that 20 plus years later there isn’t more of a presence of these things out there. I think part of it is honestly that it is as a starting point, a little bit counterintuitive.
So what we’ve been taught for the longest time is really if we go back to object reoriented, is the, this idea of encapsulation, of, when we talk about any kind of a concept and even, 40 years ago we’re talking about person, place, thing, event. When we talk about a person, place, thing, event, no matter what it is we model that as one cohesive concept, which, which then almost naturally moves to the form of one kind of a table.
And therein lies the rub, right? Because we know the reasons why we do that is it becomes an entity like in a third one form, it becomes a dimension and dimensional modeling. Because it’s important to maintain the consistency and integrity of that concept, especially in an enterprise environment.
It’s important to keep that integrity by having it in one place under one key. And of course anyone who’s done any modeling understands that one of the predominant features of a modeling paradigm is this key dependency. And whether you’re technical or it’s business oriented, really what you’re saying is it needs to work by us having descriptors and relationships that relate to the instance of something that’s identifiable.
That’s pretty much it. That, that defines the heart of all modeling. So in a way, what we’ve done is we’ve funneled our thinking into this singular concept model, this encapsulated concept model, this object. And what we’re saying here is basically in our pattern, if you take the simplest approach to it, the pattern says, Yes.
Keep that consistency that key dependency. Keep it all together. One, you’re only describing one instance of one record. Keep that together. However, allow yourself to have multiple tables. And this is logical and physical, right? Multiple tables that can then be added on later, so you can build incrementally or they can be changed without impacting all the rest of it,
so this re-engineering issue becomes a non-starter now, because we can just add things on. And of course, if we do need to modify something, we’re only modifying a very small part. And so when you saw it, and like you said, you fell in love, I dove into this, had, head over heels and the same thing.
We fell in love. And then when you go around the globe, And that there’s other people that have arrived at the same logic and that it makes sense to do this. And there’s discernible pattern rules that are repeatable and predictable. It makes the whole thing work great.
I would say circling back to your point, it hasn’t seen probably the adoption we would imagine it would have by now. However, if you are like me, when you do run a class or if you’re on a conversation or you do have a few hours with data engineers and you have that conversation, you shown what you’re doing.
At that point, I would say the vast majority of those people look at it and can say, yeah, of course. Wow. That’s the way things should be.
Shane: And I think that’s one thing that took me a while to realize is that data vault is a form of modeling that we can articulate as ensemble modeling. And there are other forms of ensemble modeling in the world, right?
Hans: Yes, absolutely.
. And that I gotta tell you, one of the benefits was, is when I was working on writing my book, I had enough time to really focus on writing it. But at the time I was living in Stockholm and I could take the time to, to travel. And visit with people that were doing similar things.
And so at that time, I didn’t have to travel far to meet with Lar Roach, with anchor modeling, because of course he’s outta Stockholm and worked in a university there. So I get to meet with him Danny Schneider with Truva and had inversion modeling. And of course things like Focal Point, and there’s various iterations of what people are referred to as six NF or, because again, that ha that can be interpreted in multiple ways.
You could look at temporal modeling from a agile temporal modeling perspective. So there’s, there were definitely a good number of other people doing this. Being part of our consortium when we got these people together we really, I think, what would you say we, we embraced the commonalities.
Of what was similar about these patterns and found that truly they’re probably 85, 90% the same, at least from the beginning point of what we’re doing. And so in that regard that is what we refer to as the ensemble modeling because it’s the essence of this kind of, break things apart into multiple components, but retain that consistency through key dependency.
We, we say unified decomposition as a term for it, but that really is what drives pretty much all of the ensemble methods.
Shane: Okay so let’s jump into, , explaining some of those patents for people who haven’t heard them before. , one of the reasons I loved it was an implementation approach, and it was actually we could build code libraries that applied the patent 101 times from that same piece of code.
And that was really based around the idea that in, in the core vault, there are three objects that we care about from a physical modeling point of view, hubs, sets, and links, so hubs, satellites, and links. And the reason that’s important is that means actually in a, in, in a basic vault model, we only have six bits of code, crate hub.
Populate hub, create set, populate, set, create link, populate link. Now, there’s lots of variations we can talk about, but we can effectively automate the building of a vault based on those six bits of code, we pass some parameters to the create hub code. It creates a hub, we pass some parameters to the load hub code, it will load that hub,
and we can do that a hundred or a thousand times off that six bits of code. And those bits of code become incredibly hardened, we use them so often that they become bulletproof, which is what we want in the agile data ops type of C I C D world. So if we think about that, let’s start off with the first two.
Let’s start off with hub and usap, so if we think about decomposition, if we think about, we have a core business concept of customer and there’s some things we know about the customer. We know their name, we know their date of birth,
So we’ve got some things that we can use to describe that customer. When we’re modeling in vault, how do we, do decomposition for something like that simple core business concept.
Hans: . So that’s a really good point. And I think we’re taking, like you said, customer to start with. And you would start really in any kind of a concept modeling perspective. You’d say, first of all, identify what it is, and like you said, in this case we say customer important.
Of course it’s business driven. We got a good one to work with. And now, like you say, you have a date of birth, you have a phone number, you have an address, something else that, that helps describe elements of attributes that help describe that customer. So in, typical modeling, any kind of modeling really, again, dimensional and Entities.
If you’re in three and f or just concept modeling in a logical perspective, you would take the attribute sets and tuck ’em under that key for customer in a table. And what we’re doing here is we’re saying, yeah, keep that key dependency, but put ’em in a separate table. And really you can define as many tables as you want to have those attributes in them.
And so the hub, what we refer to as a hub, there would just be the identifiable instance, the key of customers. So first of all, small challenges, especially in enterprise warehousing is, let’s establish a unique key that can be used throughout the enterprise, recognizing that it’s gonna be referenced by and data will come from multiple different divisions, departments, and functions of the organization, all towards a central concept.
So we’re, establishing that enterprise y key. Is the first challenge. And to be clear that’s the exact same challenge that we have, whether this is dimensional modeling or three and f modeling or any form of modeling. We, you have to have, you have to start off with square one,
which is, you have to have a uniquely identifiable key for that concept. And once you have that, you’re good. Now, whatever that is, however you got to it, that would become your hub. But there’s nothing else in there. It’s just that instance, so that, that becomes like this core, firm underpinning,
that it’s there forever. It, there’s nothing about it that can change. It exists. If you want to tell me later, it doesn’t exist. That would be context, that would be because it did exist at one time. If you’re saying it no longer exists, now you’re talking about history and context, but that hub just lives on its own as only basically one attribute.
Of course, as you mentioned, with loading patterns, we have a technical wrapper that goes around it just to say, Here’s how we identify it. Here’s where it came from, here’s when it came. Then the satellites would take all or any of those parts, of those attributes and form separate tables whose key is the hub.
So this way it’s still the same dependency because the key of any satellite is just the key of the hub. That’s it. Now because it’s temporal, because it’s a temporal, we track history, time tracking environment, we’re also gonna couple that with a day timestamp to make it a two part key in the same way we would for a type two dimension or historically tracking three and f so it’s a very simple pattern,
it’s but as we all know, the way to interpret that is to say concurrently active records could only be one, that’s the whole theory behind type two dimension as well, which is concurrently active, there’s only one record, but for a different time slice there could exist another record for the same key.
So we’re just stacking the history in there, that’s how satellites work. And typically what you end up with is, 3, 5, 7, yeah, maybe eight or 10 satellites around a hub. And they’re logically grouped, set of attributes. So maybe all the things that rate the qualify your customer might be in one and stuff about geographic location and such might be in one your primary or main one that may have like the date of birth and phone numbers and things may be in another.
Or they could, again, they could be all in separate satellites, one attribute per set, like anchor or it could be mostly just everything in one set. Of course, you gain more of the benefit from this pattern if you logically split them, because then of course, as you’re building incrementally, if a new function of the business starts to deliver additional context about a customer.
It can put that additional context in a new satellite without any re-engineering. So you benefit greatly from a logical separation of the attributes.
Shane: All right. So there’s actually a lot of patterns I’m bit in there and I’ll break it out into, to what I see as the conceptual patterns that we use and then some of the physical ones, because as technologists, even in the vault community, we love to argue the detail,
and effective dating versus flagging is one of the really interesting technical ones we argue about. But we go back to that core, basic, so we have a hub. And a hub basically just holds a key. . That’s what we’re saying. So we identify what the key for a customer is, as a customer id, whatever it is.
And that is a difficult problem. Let’s not underestimate how hard that is in an enterprise system to have a single view of customer and a single key. But let’s just say, we start off with one system and we’re relatively lazy cuz I am, and I do a sore specific customer key, and we’ll talk about that later.
, so I have this key, which is customer ID comes from Salesforce, let’s not use that. Usually use HubSpot because it’s a nicer system. So customer ID comes from HubSpot. So we create a hub customer and a holds a key, and the other pattern you talked about there is that hub is now immutable.
When we saw a customer ID turned up, it goes in the hub and we never delete it. Because we said at some point in time we’ve seen that customer with that id, we know it happened. And we want an immutable record, which is one of the benefits of Vault, baked in is everything’s immutable.
That’s one of the core patterns. And then in the set, we hold the detail about it, about the customer, we might hold their name, their date of birth, some demographics around gender. Anything else? And again, a core pattern of vault is those sets. They’re temporal, they are effectively S c D two, society change intervention type two.
So what does that mean? It means every time we see a customer, and let’s say they change their name, we go, we found that customer. We have a record for them in the set. It’s based on their name. We’ve seen a new name turn up. We insert a new record. It said at this point in time, the name was this.
And we keep doing that. We keep rack and stacking these temporal changes into the set. And that is a core pattern of vault. So we don’t have this argument we used to have in dimensional of s c D type one, type two, type 10 bloody blah. It’s type two by default, that is one of the core patterns that you don’t break.
Now we may have some technical implementation choices. We might flag the previous record as being end dated. We may only have the effectively. A date where we saw the business effective of that record turn up. And we may write some code that windows to say, at this point in time, what was the active name for that customer?
There’s a whole lot of technical implementation that we may wanna do when we build the vault physically, but that’s just based on a set patterns in a No. And the type of technology that you use the way you think, there’s pros and cons of all of those. Don’t sweat those. Just pick one.
And the only point you’ve gotta do is be consistent just same pattern for the way you implement it. So that’s it, sat and hub. Really simple. Now, sometimes we break the sets out, like you said, so we can go and say I’m gonna have one hub, which is hold my customer key, and I’m gonna have two satellites, and I might decide that I’m gonna physically model those sets based on fast moving data and a slow moving data,
so maybe I’m for some reason holding a number of orders in a set for some reason, and that’s a fast moving piece of data. So I may wanna store that into a different satellite, so the rate change of the satellite is slightly different. Maybe I wanna do it based on, numbers and text.
I have a whole lot of case notes and I wanna store that into a separate satellite for some reason. So they’re just choices. What I do is I start off with a course set, and then when I get a new field coming through that I need to add, I create another set. Why do I do that? Because I have less blast radius of change,
I can just create another set with that new field automatically get the ensemble back and the things I’m touching are less what’s the cost and consequence of that? Now what I’ve got is model set tables, and that is one of the things that people don’t like about vault, is actually you get a plethora of tables,
so you end up with one hub and, 5, 10, 20 sets over time for that. So multiple tables that describe that customer there’s ways you can deal with that from a physical implementation point of view if you want to, so they’re just, again, choices. So hub holds a key sat, holds descriptions about that key.
What’s a link? Cuz that’s our third core pattern for Volt,
Hans: And let’s definitely jump into that. I would say also, just to piggyback off your set conversation for a moment you’re absolutely correct. I think that there’s a lot of flexibility afforded to us in how we do that. One thing I can tell you is from, I would say more active deployments in the last 15, 18 year window, we really haven’t run into cases where people have been stumped.
By satellite design, so even though I think now listeners can agree, it probably sounds a little soft, we’re saying, Hey, you can do this, you can do that. But in the end, if you take into account that there’s a few variables, the rate of change of data for sure is one variable.
What kind of data, what class what function does it belong to? What does it mean? What’s logically grouped based on some kind of a logical grouping mechanism, whether it’s a, an ontology of some kind, whatever. What ends up happening is that people will design SATs and then ultimately they will turn out to be pretty good.
And of course if you need to refactor us that. You can, it’s re-engineering, of course it’s work. Create a new one, eventually cut off the old one. Make the changes you gotta make. Maybe it’s the data migration, maybe it’s not. But the point is that generally we end up with a bell curve. Like I think even in your worst case scenario, we’re talking about 20.
I would say that we usually end up with somewhere between three and seven satellites on the foundational concepts that we model. And and certainly it’s a bell curve with the high end of that being, say a couple dozen, maybe it’s 24. And we’ve seen that in a couple of client sites where it is and upon review we go through and say, yeah, actually these are valid, logical, good 24 because they happen to have.
A discernible amount of those things around. In this case, customer would be a good example of that. That would be one client or customer. Another one might be if you’re in the in the medical field and, hospitals, whatever around the idea of a patient or around an appointment. There could be more than you would typically anticipate would be normal.
But getting beyond that couple dozen I’ve never seen it. And like I said, the certainly within one probably two standard deviations from the norm. You’re the first standard deviation. You’re in the three to seven for sure. And as you get to beyond that, you’re one, of course, it could always be just one and then maybe it’s nine or 10
it’s not gonna be crazy. so just to give some comfort in that area. In practical reality, it hasn’t really been much of a challenge. When it comes to a link is a is a way we form from a technical view. It’s how we form the constraints of relationships between concepts.
And of course if you’re listening your three and half modeler you’re talking about a foreign key embedded into an entity wherein it forms a constraint of a relationship with another concept. What ends up happening in, in all ensemble methods anchor, focal to all of ones out there definitely vault, is that we move those relationships out of the concept, which of course makes sense,
unified decomposition and the hub’s only a key. So there’s nothing else in there. So the links are just relationships. So all they are is they’re a representation of a combination of keys. That former relationship. Really important thing about this, and this is where it’s probably worth if you’re getting into vault now, is to spend a little time in understanding the dynamics of what it means to be a link, what it means to represent a relationship.
Because let’s put it this way a combination of keys and a link from a technical view should approximate a forum key constraint. So if you’re basically in a pattern of three and f and you’re writing create constraint, whatever entity customer, class ID with customer class id, and you’re combining those two keys and creating that constraint that line of code is effectively a link.
Now the thing is that the link is however designed. Not just a, every foreign key gets its own link. We designed them by the naturally correlated concept. And of course, this is driven in multiple different theories from how we gain this knowledge from the business. There’s a course discussion around CBC and mbr we call core business concept, natural business relation, but also the core business process that drives how those things get combined.
So let’s just say for arguments sake that just like you have to trust your instinct and how you design those satellites, which isn’t nearly as scary as it sounds, right when it comes to link relationships, you’re gonna end up with a table that’s just a set of foreign keys. That’s all.
It’s super simple, but the way you decide how to combine what keys is gonna be based on what we refer to as the natural business relationship. And so for example if you have a sale, and in that sale, it would just of course, require there to be a customer and an employee and a store. In other words, the customer, employee in store are requisite components of a sale occurring,
that’s the definition of what it means for that event to occur. Then you’re gonna end up having a link, a singular link that ties the sale with also that customer and the employee and the store all in one link. And it would be the proper design to combine them all in one. And so that’s where the element of understanding of natural business relationship is what drives your decision as to which foreign keys belong together in a link.
Again, just to reiterate, when you step back, the link still doesn’t mean anything by itself. In other words, That link isn’t an instance of anything by itself, it’s the relationships of something. So in, in this example I just mentioned, that link that ties us four together is the relationships, header, relationships of a sale.
That’s what it is. But by itself, it doesn’t mean anything. It just means, hey, I’m showing you what are the natural correlated header relationships for the sale. So there’s always gotta be for the fill in the blank, whatever that hub is, right. In order for it to work.
Shane: The way I think about it is I think about it as core business process and I think was on one of your courses that you talk about, Peter, the Fly was Peter, . The way you described it was you’re in the organization, you’re standing there and you’re Peter the fly,
you’re stuck on the wall and you’re watching these things happen. What does he see? And I still use that, so I think about, let’s use a simple example.. Customer orders, product, we see that happen. So we know there’s a customer ID cuz there’s a customer involved.
We know there’s an order id cuz there’s a, they’ve ordered it and we know there’s a product id, a thing they’ve ordered. And we’ll come back to many in a minute. And then we might see another process, we might see customer pays for order, so we know there’s a customer.
Yeah, customer id. We know there’s an order id and now we know there’s a payment, but there’s no product involved, we’re now paying for an order, we’re not paying for a product or are we? Good question. And the next one is store ships order. Okay, we’ve got a store now.
new thing involved, new hub, new concept. There’s a shipment, so we’ve got another core concept coming through and we’ve got that order. Same id, oh actually, are they shipping orders? Are they shipping products? If they ship half the order, two products off the order, but not the other two.
So maybe it’s actually, store ships product for an order. Right now we’ve got four keys involved. So this is why we do it, cause we’re having these conversations. And then we go, okay customer returns something, or what are they returning? Is they returning the order? Are they returning a product?
And so we can see that we have these conversations about core business processes, and by doing it, we’re identifying these concepts, concepts are a hub. The core business process are a bunch of relationships. They’re the links. So when I see customer orders product. I’m gonna have a link table that’s got three keys in it,
customer ID and order id and a product id. I’ll start off with that and see if it works. When I see customer pays for order, I’m gonna have a link table with a customer id, a payment ID, and an order id. But if I see customer pays for product on an order, then I’ll have a full part link.
And one of the benefits of vault is you can very quickly physically, logically or conceptually model and you know what it should look like, then build it physically, cause it’s just a bunch of keys and see whether the data actually fits what the business think they do. And that’s one of the values of it is gives us that agility,
we can go we think, it’s customer orders, product we go and we quickly prototype it and then we go, actually it’s not, there’s a whole of extra complexity in that data. We’re in that core business process that they didn’t tell us about, that we didn’t see. So for me that is part of the value,
is it allows us to be very quick and agile and learn early, get feedback rather than big design upfront, big build and then, oh, that’s not what we wanted, or you got that wrong. .
Hans: I would say, just to piggyback off of that chain, is that, one of the great things about this is that, we have come kind of full circle to where a lot of the let’s see engineering mulling over the case and trying to figure out what’s going on, how to model it.
A lot of that’s been taken away. Or to put it another way, we’ve been freed up to be able to rely on what’s actually happening. So when you get back to that, Peter, the fly thing. What’s beautiful about it is that, we can say for example, that, the customer is purchasing these products in a store from this employee, whatever, and model that in the most logical way.
Of course, here’s two different grains represented, two different levels represented one at the header and wanted the product of this one, one product. And whichever one we’re interacting with is the one we model to, so it’s just very simple. We just follow what Peter sees. When you have the payment or the shipment we do the same thing again, but from the perspective of that new event
so for shipment, it’s, what am I shipping? I’m shipping products now they happen to be part of an order. Okay, great. Does the shipment know about the products it’s sending? If it is shipping at line item level, at product level, Then it knows then that’s what you map to. If it only knows about an order, then you map to the order.
And that’s because that’s just a mirror, a reflection of the exact business that you’re seeing that’s it. And so what’s nice is we never have to worry about when we build the shipment that it’s gonna impact or change what the sales model was, cuz the sales model’s already done. And remember we’re doing additive, incremental.
We’re bolting on we’re just gonna bolt on what was there before. So we lead that where it is because when the shipment comes, that’s its own subject or its own feature. Its own thing. It has its own relationships. They’re already done. We just have to identify them. But that’s not gonna change what we had before.
Cause what we had before was also a valid process. And so you’re gonna see many links between the same concepts that drive these different relationships. If you have your third normal form hat on, You’re gonna look at that and say, wow, circular relationships bad. Rip it up, throw it away.
It’s not good. But that’s because you have your three and F hat on. You gotta take that off for a minute and think, put on your agile incremental build hat, the one that says that when something is there, you don’t have to re-engineer it when you add something new. And the relationships that are unique to shipment are not the same relationships that are unique to sale.
You don’t have to change, modify, or impact them in any way. You just bolt on the new one. And I think the last point to consider just to and you can tell I get excited. This is why the pattern works so great. The last thing in what you said that I think is good to consider is when you start to put the data in, if there’s a gap wherein the data you have can’t fill the model that you now designed, there’s one of two things
either the data is right and you need to change what you’re doing, or the business is right. And the application that captures the data is deficient in which case, now you can have a communication with the business and say, Hey, this looks like to be the way your business actually works. For example, your shipments know at the product level which products were delivered on that shipment.
But your data doesn’t capture it. And then your business might say, oh, shit, we should capture that. And then there might be a revision to the source system because it’s not meeting the need of what the business is looking for so when you have a gap between the two, it’s actually a very meaningful thing.
I, it’s either revamped the model, something went wrong, or it’s the source systems need to be adjusted to meet where the business requirement is and that communication you can imagine would be hard pressed to happen if you didn’t have this catalyst to, to point it out to you.
Shane: Yeah. And I agree, so Volt gives us a hell of a lot of agility, it gives us the ability to adapt, to change. That’s a perfect example. If we see customer orders, product and we see store ships order, as people have done data work for a while, we know there’s a problem there,
we just intrinsically know. But maybe they don’t. And often we talk about seeing as believing, show them the problem and the data and then they’ll go, yes. Or they know the business problem is there, they just can’t articulate it. So let’s use that example, where we go actually store ships order. There is no way for us to tool if we’re Peter the fly, what products were actually shipped, because it’s not in the data, we, somebody sees it, they put it in the box, but that’s not captured anywhere. So we can infer, we can join those links later on and say, when a customer orders a product and a store ships an order, we can infer that those products were shipped.
But we are making shit up, because Peter, the fly and the data hasn’t seen it. We have no data proof that product was actually shipped, and that’s typically a business problem because now what happens if they didn’t ship it it’s like we inferred, you shipped it. The customer said, I didn’t get everything.
We’re like we sent you the order. Could you tell us which product you didn’t get? Now, that’s a ship business process, isn’t it? So why wouldn’t we change our business process to capture the products that we are shipping? Now let’s say that we’ve been running this for a year, so we’ve got this link, which is store ships order, and then we finally fix it,
we do a digital transformation or they actually work in an agile way and they get to the backlog and they actually solve that problem, and now we have data about the product being shipped. What do we do? Vault? We create a new link, because we say there’s a new business process,
which has data that’s That supports us. So now we create a four part link, we have a store which ships a product related to an order, so now there’s just four keys and we have a whole lot of technical choices about whether we have the data to backfill that for history or not.
But from day one, after we implement that process, we have a new link. It starts capturing that information and that data is now captured going forward. That other link? . Store ships order. We may retain it, it may still have value, we may retain it for a while and d and decommission it. We have a whole lot of technical implementation choices, which is great.
But the key thing is we’ve just created the new link in 10 seconds and that data’s now available. And that’s the other core pattern of a link is it is a many to many relationship. So we say, okay, so customer orders, product, three part link. And there is no way that the customer can place an order with no product.
Never happens, never. Now we know that in the data world never happens is not true, so let’s just say, agile release of the software and somehow they freak it out and actually customers can now place orders that have no product lines. What’s gonna happen? In that link we are gonna see a customer ID and an order Id turn up,
with a nail product, there’s gonna be no product id. Now that table is just gonna absorb that right now. We should have observer believe that says whenever we see a link, that has missing keys, tell us about it. Cuz we, we care, there’s something that we thought should never happen that has happened but it just absorbs it,
it says the data says this is what you did.
Hans: I think that gets back down to a philosophy too, it’s most of us are on this discussion around agile data. For the purposes of enterprise integration, warehousing something in the, in that realm. And when you get into that, in that realm we are going to be copying in data that’s already been captured,
we’re not the system of record, but we’re taking a data that’s ordered and captured, and it should have been subject to data capture rules that prevent that kind of an error from occurring. But now we have the the benefit and also the duty, to, to become the feedback loop for the business to say, Hey, look we’d like to point out to you that there is a potential issue here.
And it looks like the potential issues that we have. 1.8% of your. Sales here, don’t have any product on it, or your orders or shipments don’t have a product on it. And I say, oh, that’s not right. And then you can go and investigate that and find out what’s causing it. And hopefully it’s as simple as a data error and that the actual sources are accurately actually capturing it, but we’re just now receiving it.
Or, who knows? But it could also be that it’s right at the point of the actual source systems, data capture, system of record that it’s not adequate and there’s an error and it needs to be fixed, and then they can prioritize that. And so we’re helping them out in a way, almost auditing and feeding back to them what could be a problem with their business processes as they’re deployed in their sources.
Shane: And the key thing is it adapts to change the pattern of how we load that link doesn’t change, it just adapts to it and says, the data told me this is what happened. Now, whether that’s the business process you expect to happen or not, that’s a different conversation,
and a valuable one. So if I think about, in the early days when I had a team doing star schemers with this modeling technique that really, I felt dumb cause I couldn’t quite understand the intricacies of, fact tables and grain and a whole lot stuff. I kind in my head, I did a mapping,
and I go, if you’ve got a dimension, an ECT two dimension you effectively take the key of the dimension and that becomes your hub and you create everything else outta the dim and that becomes your set all sets and in my head, that’s all we do is just decompose dim down to those two things.
And then you’ve got the first part of your data vault ensemble. And then if I think about a fact, , we take the fact and the link almost becomes the fact, it’s holding the relationship. So rather than foreign keys between the fact table and the dims, we’re effectively just getting a longer, wide table of keys.
And that link effectively allows us to , reconstitute the fact, so again, we don’t hold the value of the fact in the link, we hold it in a hub combo, but I think about it that way, is that if you’ve got a fact table, that’s the equivalent involved is a link in my head.
Hans: I would say just to clarify you can say from the kind of backward engineering perspective Yeah. That a dimension would be basically a hub. Its key would be a hub and also all its attributes could be split to all the different satellites. If it had any rolled in outliers,
where there was a break from the true key dependency which you do a lot with dims, you would then probably have alternate outlying hubs and link relationships to those, like store to a region or something that would be further out, one level out. But really super important is that a fact table, first of all, it doesn’t map really purely very well to the vault at all.
But if you were gonna follow that pattern, keep in mind that whether it’s a dim or a fact, the first thing is it’s gotta have a hub, so the first thing that’s gonna come out of that fact table is a hub representing whatever the event is that drew those things together. And then it’s link would be one starting point for the link relationships between them.
But of course, the thing with the fact table, it’s got a lot of mixed screens and nulls and redundancy. It’s probably gonna end up being multiple. Links in effect when it becomes a business model. But I think you’re right, that the event-based hubs and the links that, glue together, the concepts are more or less in the effect table area and of course kind of foundational concept, not event based but person, place thing.
Hubs and their satellites of context will be in the dimensions like I said, with the exception of some of those links to outlier hubs . But what’s really interesting is you’re looking at from the fact dim to the vault model.
What’s, what I think is really interesting here is if we model the vault model correctly, if you just take that thinking and flip it. If we model those hubs as the foundational core business concept, which would be the element like your beam core basically dimension,
then the satellites in combination would basically be a fully flattened dim. And so pulling from a hub that’s say customer product, folding in those attributes and then finding it to a dimension should be a trivial. Process. And of course, if you’re in memory, you don’t do the whole thing.
You just select the attributes that are pertinent to your query at the time what question you’re asking. And of course, interestingly enough, even if you have a hundred and plus attributes for a customer, for example, the average number of attributes that you ever use for downstream reporting in a dimension is somewhere in the three to seven range.
And for example, if I’m doing, regional product analysis, like which products are popular in what regions? I need to know the product sales associated with the instances of a sale record by a customer. And from customer, all I really need to know is their postal code in order to establish their region.
So for that particular query, all I need is one attribute, so folding that into a dimension is super easy. And of course, You might bring some demographic attributes as well, and hence you get to seven, but the number of cases where you ever need to pull into a virtual mart, delivery a dimension with all 120 or whatever attributes is gotta be less than half of 1% of the time.
So that’s what makes this whole thing agile. And why? I think we’re talking on agile data,
Shane: If we think about common patterns, we often see a three tier data architecture, source data come in and being historize and immutable and raw. Some form of middle area where we model heavily, to bring in the business concepts or to combine data or to change data or to clean data.
And then some way we consume it, some presentation or consumer layer. And, I crack up that we go through these waves of what’s cool. Databricks just came out with the medallion architecture of bronze, silver, and gold, which is pretty much raw. You’re fried with and stuff you use,
and I reckon, and again, I don’t have the numbers, but I reckon if we went and actually did a survey of all the vault implementations in the world, We would see the use of dimensional models or star schemers in that consumption layer, in that third layer on top of the vault a lot of the times,
they may be physical, they may be virtualized, cuz we can create them as views. But we see turning it back into a dimensional model for consumption by tools is very common. And that’s because a lot of the tools are written to use star schemers, or, there is, people tell me all the time that and analysts, everybody knows how to use a starche.
So it’s the best tool and I call bollocks on that. Because you’ve gotta learn, once you’ve learn it, it’s easy. I actually say one big table, everybody knows how to use one big table. Unless of course you want to count customer, because then you gotta do count distinct ,
There’s nothing wrong with going, into a vault. And then putting a star schemer on top of a vault for the way you consume it. And when we talk about patterns, now what we have is we can codify that as a code pattern for physical implementation, we can go we know the relationship between a hub and its sets,
it’s an immutable law. We can’t break that. So we can write a bit of code that rehydrates, that hub in those sets back to a dim, it’s just a piece of code we use a hundred times. We know that a link holds a bunch of keys to the hub. And we know the hubs hold a bunch of keys to the sets that describe those things.
So now we can write a piece of code that picks up those keys from the links and goes back to the hub and then goes back to the sets, and then build a view dimensional, one big table snowflake activity, schemer. They’re just patterns of how you consume it. But they can all be mapped programmatically back to the vault model.
Which means we can automate it, we can make it easy. One of the things we talk about is the problems of vault is complexity of tables, there’s just there’s lots of them. And in the past we cared about that for two reasons, one was storage, cuz we used to have a really expensive on-prem database.
That’s gone now, we’ve got the cloud analytics databases. Storage is relatively cheap. They can hold thousands of tables and we really don’t care anymore. And then the second one is how hard it is to use as a consumer, and again, that’s why we have to bring this constant, consumer layer in.
We have to make it easy for them and we have to automate that. But Vault allows us to do that, so would I ever suggest that somebody builds a data vault and then has their analyst and hit the hub sets and links tables whenever they want to do anything? Hell no. Be nice to them, give them a lens, give them a view, give them a nice pane of glass to make their life easier, which is what we should be about.
Hans: I agree. I think though, one of the interesting things that, that in my 20 years with this has come full circle is that we’ve gone from I guess taking it for granted that, yeah. It’s more complex. And what we’ve found now is literally it’s easier and less complex. And this is what’s really bizarre about the whole paradigm because you actually end up in thermal form in other forms of modeling, creating table structures that have confusing or get cumbersome in their usage or you don’t understand why they’re there.
Whereas in, our forms of ensemble modeling right now, there’s, there isn’t a hub out there, that isn’t a core business concept. Or let’s say, line item. Some kind of a key for a major relationship that you have, and so when a business reviews that without looking at all the three to seven satellites, but just hubs and links and they’re named properly, that’s actually kind of mimics in a way the way our brain works.
It’s very simple and easy for people in the business world to see their model represented in these hubs and links, CBCs and NBRs. So it actually, from that perspective, makes it easier. And what’s also interesting, the satellites is if I’m in one business area and I’m looking for a certain set of context, and we’ve designed those satellites logically based on business functions or whatever type of data rate to change, they’re gonna spot their satellites.
To see what they’re looking at, and it’s gonna make a hundred percent sense to them, versus looking at that, 120 attributes where, you know, 85 of them mean nothing to them, and it just confuses them. Th this is where they can focus in on what they’re looking at. And of course, now being able to select only the pieces you want makes it easier.
So I would say that communications about the model with business users and with everyone is easier this way. I still agree with you, however, that I would put this lens over the top of it for any specific set of business users to view it in their, what is typically dimensional view because that is how they’re used to it, and it is for them cleaner and simpler to visualize in that manner,
when they’re looking at the data itself to see it in that way is cleaner. One other last interesting thing is The amount of space it takes to store every slice of history and all the data. As it turns out on ensemble patterns, it’ll take less space on. If we had disk, if it was dsy, certainly on cloud.
I recognize it’s irrelevant now cause it’s all pretty much free. But it’s a little bit less than otherwise. But the table count does mean more joins if, as a general platform. Now, even if I only select one satellite outta seven, it’s still one extra join to, to navigate from the hub to the sat.
What ends up happening there is we’ve had platforms as of the last five to 10 year window that are space free. But process expensive and a big part of the process, expensive side is you’re petting the cat the wrong way with a lot of joints with these methods.
So , it’s expensive. So I think from our perspective is. Looking to the places where we can see how to make the repeatable pattern and the selection of a subset of lengths and SATs for particular deployment really helps mitigate any of that pain. And coincidentally the current models of things we’re using today have actually gotten much more efficient in mimicking, I’d say mimicking, joins, but in, in processing and working with joins than they were before.
So I think we’re getting technology bailing us out anyway
Shane: Under the covers we use BigQuery. BigQuery doesn’t like joins. It’s a column of database. It does them, it doesn’t like them, but we’ve never had a problem with it, not running them. We just try and remove them for the user as much as possible. And the other one is, that cost one’s a really interesting conversation cost to compute.
If you look at BigQuery, snowflake you pay for what you are computing, they charge you different ways, but it’s always based on . You query more, you pay more, that’s the model. And so with the Vault stuff, you can think about it in clever ways
If we have users that always go in and write a query, which is select star. Then we pay for scanning all that data right over our credits for the thing to run for the period of time to process all that data. But if we are monitoring as we should in data ops, we should be monitoring who’s running, what query, what the common queries are, what the common columns that are getting hit regularly.
Then we can restructure our stats, , let’s put the commonly hit fields in a set. Let’s give them a view or a physical table, or a dimensional model. That is customer and most commonly used customer fields, and they can select star all they want and your cost goes down. And to be clear, we give them a second table or view or star schema, which is everything,
and we say to them, if you need everything, then go over there. We bring this idea of friction, and for a reason, go hit this one first. It’s cheap, it’s easy. If you need everything, it’s there. There’s just a little bit more friction for you to go and use it, because there’s a cost and consequence of you doing that.
, there’s many ways we can structure those satellites in the way we provide that data to be consumed based on our context, it’s always about our context, what are we focused on? How does our organization work? What’s important to us? So if we go back, so hub sets and links,
they’re the core immutable things within vault. There are some other patterns, there’s cells and hows and a few others which we use for specific cases. So I don’t wanna go through all of them because we will run outta time. Obviously they can come on your course and find out all the rest, but let’s just cover how cells because they’re variations on some things we use on a regular basis,
so can you gimme a explanation of a, and then a how or a how, and then a up to you.
Hans: Sure. If you’re I’ll explain it in two ways. I think that, if you’re accustomed to the form, it’s effectively the ensemble embodiment of a recursive relationship. If you’re not as familiar with Thefor, what we’re saying is that if there is some kind of a relationship wherein from the same concept, you wanna relate it to itself.
For example, one would be for duping, you would say, I have three records in here that came from three different functions of the business. Even though I did a good job on my enterprise vip, I’m not 100% confident that these three records are all the same. But I believe they probably could be.
So what you do is you say let me form a relationship between let’s say customer again, customer A, which is Hans Horan with an HH 1 0 2 key, and customer B, which is Hans Horan with a 91 7 24 key, some other key from a different place. Now I have reason to believe they’re the same, but there’s also a chance that they might be just two people named Hans,
it could be. So what I do is I say let me form a relationship between those two records where I tell you that I believe these are the same. So that would be a same as Link or a Sal, same as link. So you just have a link that says, I have, I’m a two, two A link, and I’m pointing both of my directions to the same hub so I can grab two different records.
And I’m telling you that I believe they’re the same. What’s nice about that is that we can then further add a descriptor hub to that instance, if you look at it from an entity with a recursive relation, you would embed the foreign key in the entity pointing to itself,
that’s all recursions in three and f. But what’s beautiful about this is that you could have five SALs outside of that customer hub, because there’s no weight on the customer concept at all, a sal is floating on the outside, it’s floating around that instance of the hub key table for customers.
So it’s not embedded anywhere. So let’s say you throw one on there and say, oh, I’m gonna match first name, last name. Oh look, I got 50 matches. They’re all the same. I. Of course if you’re in Sweden, I can tell you right now that’s not the same person, because that’s not enough information to make it match.
So then you have another cell that says look, both these systems may not have all these attributes, and even if you did, they may not be entered the same way. But I tell you what, if you did have all these attributes, and if they were in it the same way, I’m 99.9% sure it’s the same person. So now you can say, my algorithm is first name, middle name, last name, date of birth, city of origin, mother’s MA’s name, highest level of education.
Okay, once you’ve got into that, you’re like, shit, that’s the same person, so now if you think about it for data science purposes, for reporting purposes, how are you gonna use the data downstream? It’s imperative to know what’s the strength of that de-duping logic, and it shouldn’t have to be either or.
So what we say now is as many as you want. And of course if one of them is not useful, you toss it. There’s zero impact on the model, because it’s of course just the same as link. Now the hierarchy link or the how it actually works technically the same. It’s a single link with a key pointing to the hub twice,
so two different keys pointed to the same hub, but here it’s saying there’s a parent child relation. And what’s really nice about that one is you name the first one in the link that’s called how, the how link. You name the first one parent and the second one child, and you load it with a parent child relationship.
What’s beautiful about this one again is you didn’t do recursive hierarchy for this like you would in three and F where you had to embed a foreign key. The benefit of that is that now guess what? When you go enterprisewide, there are multiple customer hierarchies. There are multiple product hierarchies,
so you end up having a product hierarchy that’s based on its marketing or sales or campaigns. You have another one that’s based on the kit that it was built from. You have another one that’s built on what regions it’s in. The hierarchies can be from different perspectives. You can have three Hals, five Hals on a product that are each having their own specific meaning.
Plus, if you have a descriptor keyed instance on those, you can now accommodate if you’re familiar with dimensional modeling and all the issues of the jagged and sparse hierarchies and different issues of jumping levels. Because we can describe each one individually. You can be succinctly outlining any form of hierarchy at any level, jagged, whatever.
And it embraces and it captures it instantly. So it’s a very flexible manner. It’s actually one of the exciting things I think about Vault is that. Besides just the unified decomposition bringing us this kind of unprecedented agility. The whole realm of de-duping in warehousing, which is important, and the whole realm of how do we capture varying hierarchies, has been like 10 x simplified by using these elements.
What’s your opinion,
Shane: Yeah, I agree. I like it because it it takes a repeatable problem or a few of them, and it gives us a repeatable pattern, so that one you mentioned about customer matching, that’s the perfect one, so in vault we say a customer should only have one unique customer id,
we should have a single hub called customer. Peter Ly sees you walk into a store, you get your id, sees me walk into a store, I get my id. That’s nirvana, that’s the patent that we really want, but, I really see that in an organization that has more than one system, we end up with customer ID in Salesforce and customer ID and HubSpot and customer ID in the eCommerce store, and nine times outta 10 or 11 times outta 10, customer ID in a Google Sheet or an Excel spreadsheet, that somehow became a system of capture.
And so sales allow us to do that, we say, like you said, we think these customers are the same, so let’s store that information and use it. And that relationship of this might be the same as that. And same with products, we think here’s a product catalog, here’s another product catalog.
We think those are the same products, but the SKUs are different because they’re coming from do different things. We think they’re the same. And then hierarchal links or how’s, the classic one for me is em Empa, so you’ve got a table of employees and one of those employees is a manager and they have a team that report to them.
So now I’ve got a link, where I go, this employee relates to this employee. And again that how hierarchal Link allows us to do that with simplicity. And, it keeps track of changes over time. If I change that relationship, then that link’s gonna get a new record to say, Hey, Shane no longer reports to Hans.
He’s he is been promoted and he reports to Remco, and we’re just gonna see that new relationship turn up. So I love them, I think they’re really great. I think one of the problems with Vault is you go and learn the three cores hub sets links, and then you get some ones added on thousand.
Hows, and that make sense? And then we start going into some more complexity, bridge tables a whole lot of other things that have value for certain contexts, and then we get confused. So again, start off small. Looking at the time, we could always talk about lots of use cases, head of detail modeling is addressed a concept or a descriptor.
But we’re outta time for that. So we might have to do an advanced data vault patterns podcast. But I just wanna come back to the core , one of the things you mentioned was the ability just to look at the data model and understand something. And a lot of that comes back to naming standards,
as technologists, we love to argue, is it hub underscore cust? Is it h underscore cust? Is it hyphen customer? Is it cust underscore h who cares, pick one and don’t change it, so if it’s hub underscore concept name, They make sure everything looks like that.
And same with sets, is it set underscore cust underscore demographics, or is it something else? And the reason that’s important is again, one of the values of vault is automation. So once you have standard namings, you can codify it so that it’s always created that way and it always looks that way.
One of the anti patterns I see is if you are using Vault and you have a bunch of data engineers that are writing code to manually create and populate the hubs and links, Then I suggest you have a look at different patterns, cause remember, there’s only six bits of code for those core objects.
So automate it, have some config or metadata it’s a data modeling technique that’s designed at its core for automation. And that’s why we use it in agile data, because we can write code once and use it many times. So let’s just go through the ones that I picked up talking to you,
given it’s based on some core patterns that are immutable, we can automate the creation of that data in those structures. Our code becomes hardened over time. So it’s the data ops, once use many approach the data becomes immutable. When we saw a piece, something happen in the data in the organization, we store that forever.
And, we can make changes to the organization or put layers on top to represent that data in the way it should have happened. But that is immutable. Everything’s systemized. SATs are s CD two by default, so we always have all history over all time. We can adapt to change,
so links are many to many, if something needs to change, we create a new link. Or if the data changes, it absorbs that change for us. We can re-engineer. It will, we can add a new set with very small change on the rest of the model. We can create new links.
We can, if we wanted to, if we ended up with a hundred sets, we could decide to rebuild a whole new ensemble with instead of a hundred sets, five sets and repopulate history from those previous sets into the new ones. And use that as our new consumable layer, run forward of that because it makes us feel better.
And then we’ve still got the old sets sitting there archiving with history, because that’s what we saw at that point of time. They’re all good choices. We can modify small parts, we can take an incremental approach. We can go and build customer orders, product, deliver that value to our stakeholders early, and then say, right now we’re gonna go and deal with shipment,
oh, shipment’s hard. It’s gonna take us some while to understand what the hell those systems are doing. And then they might say to us could you do payment? Just give a look. Yeah, no, that’s easy, it’s just customer pay for order. There’s no complexity there. Let’s pump that one out while we’re starting to think about the complexity of shipment.
So again, it gives us the ability to deliver incrementally, deliver small bits of value fast, which is what agile is about. It deals with master data, so this whole data mesh thing that is really annoying me because yes, I agree that democratizing and decentralizing back is valuable,
we get a whole lot benefit out that, but nobody’s talking about how the hell do we combine it again, because, customers everywhere. How do we get a single view of customer? That’s what a data platform is about. And Volt takes care of that, it doesn’t make matching any easier
it doesn’t make actually identifying it’s the same customer any easier. But once we know how to do that, storing it and using it is at the core of what it does. What is, have I missed any other kind of core whys on Vault?
Hans: I think you really went through a great list there. Like you said, ability to automate repeatable patterns, hardened over time. Immutable concepts, everything is historize. It’s by default that’s the way SATs work. Super easy to adapt to changes. Re-engineer it will SATs links and like you said, you can always refactor, which does require some re-engineering, but it’s minimal comparatively.
And of course, the pattern that we recommend you could do anything you want. The pattern we would recommend is you build the new set, you keep the old, you do a data migration if you need it. Same thing with links, if you’re refactoring them to make at a key instance or something, keep the old one as well put the new one in and and move forward from that day.
Constantly, adapt you can modify small parts without, hitting the rest of it without the impact incremental build any order, which was I think what you highlighted, which is, Hey, let’s do shipments wait complex, let’s do something else, purchasing first or whatever.
And what’s really nice about that is we always try to point this out. You treat them all as if they’re their own thing. It’s an easy bite size nugget. And when they go into the main model, they don’t cause you to have to re-engineer. And it’s if someone out there listening is a dimensional wizard, It’s just as if you’re adding new fact tables new source schema into an existing federated Martin environment.
You’re using the same dimensions, but you’re just borrowing those keys , to pull them into a new fact. But it doesn’t impact any of the other existing facts or what they were before. It’s all additive, and that’s a huge benefit to the incremental build , which is really important.
And then, like I said, also master data. What’s nice about that is it’s a continual working effort, the more we learn about which ones are the same, or actually we find out they’re not or whatever, we’re never screwing up the core. We still have the data we can work with and represent it in whichever way.
And even if you have a master data initiative in the operational world, you can match it against your sales that have been doing their work to integrate from the warehouse side and then learn from each other. Is this right? Did we get this correct? Both from our side reflecting what the business is saying and the business side, seeing what we’re observing,
we can communicate with each other on the two. There’s one other quick thing I’ve mentioned is that because of this incremental build people sometimes look at vault and say, so many tables, the total cost of ownership will be higher. Maybe the long term total cost won’t be higher, but the initial will be, and as it turns out, what we found is the initial cost of ownership is lower as well as the total cost.
Because we are now able to realistically start with one small component that once you finish it does not have to be re-engineered again. So actually being able to start with a small increment and then when you’re done with it, it’s done. Like it won’t need to get re-engineered when the other stuff is added on.
And then of course, Everything else you add on is easy to add on. So you end up being from the portion of your team, the function of your team that’s dedicated to expanding the model, that effort will be less, there’ll be less of an effort moving forward using this technique.
So that’s super cool.
Shane: And again, automate, don’t treat it as a manual Turk job where you’ve got people writing, the code for a single hub in isolation, cuz that’s just a, an anti-pain.
Hans: One quick thing. If in the future, like you mentioned, if we need to get into some of these other topics around header details and link to link multi link satellites. If people are interested and we should do another session, we could probably spend an hour on those.
Those features of the pattern, but then people listening would and should be already familiar pretty much with the foundations of all modeling.
Shane: I think for those ones, maybe we’ll do an actual video cuz I think examples we can see the data Because you started getting some of the complex use cases. So before we close out, here’s a question for you. If I look globally I would say that US is star, scheer, or Kimball centric,
the majority of the work done in the US is around dimensional. If I look at Europe, it’s heavily ensemble vault centric,
Is that true? from what you’ve seen of all the customers and all the people you work with, do you believe that, Europe seems to be more vault centric than the usa And if so, any ideas why, given that you’ve been there pretty much from the early days.
Hans: Yeah. I gotta say surprised that is still true and I think that you’re right to observe that. , I think obviously you’re in the center of this. You have been for a long time. My involvement with New Zealand has been largely through you and colleagues in that group.
I see a good amount of uptick because my lens to that environment is basically through your network and through people that we’ve met over the last 10 years down there. Same thing for Australia. So I think there’s a better percentage of. People adopting these patterns or they’re more open to it there than they are even in the us.
Europe and Nordics pretty much lead, I think that charge as far as a adopting these techniques. Definitely, Netherlands Sweden, I probably top those charts and then all surrounding areas in Western Europe and Nordics seem to be doing quite a bit of it.
The question mark as to why I wish we could solve it. If you think of the pattern in technology advancements, a lot of them come from the US and it’s not been an obscure reality. It is. It has been true for real that these things may come from the US but they get a adopted in Europe and Nordics and Australian, New Zealand before they do in the us.
, I think it’s a super strange thing. It’s not to say there aren’t a lot of people in the US doing it. There are but if you look at comparatively it’s definitely much more of a pattern in these other places. The only thing I can say is that it seems to be culturally that if you’re in Western Europe, if you’re in Nordics, if you’re in New Zealand, you’re in Australia.
If you are in charge of a warehouse program and you’re the enterprise architect it seems to be a little more of the authority and the weight and the responsibility of that rests with you. Whereas I think in the US there seems to be a little more of that. Authority actually resides in a person three levels above them I think there’s a little bit more of that culturally here in how the larger organizations run where they just say, okay, it’s the old, nobody got fired for hiring IBM kind of thing.
And then it’s not really happening as much as it should be. And it’s too bad. I think, we could all benefit more from in the future. Interestingly enough, in the last two years we’ve had a much larger uptick in our certification training for people from, Accenture, Ernest, young, Deloitte, the bigger, companies and in a lot of us comparatively to prior years.
Still in all the vast majority of things are happening. With you guys and with Europe and Nordics.
Shane: And even Australia’s funny, because if I look at that Brisbane was the hotspot for vault, whereas Sydney and Melbourne wasn’t . It was like a regional hub , and I could probably tie it back to a couple people I know there who led the charge,
and they were doing great work and therefore were successful doing it. My viewers, does Vault have problems? Yeah, there, there’s a whole lot of patterns in Vault that make things hard. Like any modeling pattern. Do I hope we get a new generation of modeling that retains the benefits of Vault and solves the problems with some other things?
Hell yeah. Haven’t seen it yet, I go and I look at unified star schema and I look at activity schema, I look at those and I go are they the next thing that solves the problem for me? And they solve one problem. And then they bring in a whole lot of other problems. Again, it’s volt with star on top,
you probably end up with a data architecture that absorbs multiple data modelings at certain points in time, given certain context, in our product we do one big table off fall often because there’s some value there, we’ll pretty much do one big table as a consumer view over and above a star schemer if we can’t,
because there’s some benefit to us and the way we do it, there’s a consequence, like I said, you’ve gotta count distinct customer, or you’ve gotta hit the customer ensemble, you gotta go to the view that’s just about customer. Don’t count customers off the the link or the event that is related to a purchase.
Going at the table that’s called customer funny enough, and how many customers we got, but anyway it’s these costing consequences of anything, there’s patents and antipas and yeah. If you think you’ve got a problem, around immutable data, around ization by default re-engineering small changes, start early and deliver value fast, then have a look at vault,
and see if it fits for you. So if they wanted to do that, how do they get hold of you? Because, I’ve been through your training. You’re not paying me to say this. It wasn’t bad, it was okay.
Shane: I use it, which meant I learned something, . If people wanted to get ahold of you, what’s the best way to, to get in touch with you?
Hans: You can email me at any time you like. It’s Hans, h a n firstname.lastname@example.org, and it’s g e n e s e E. So a whole bunch of these, genesee academy.com. And then of course you can just go to genesee academy.com and check out the website for classes that come up.
We did pretty much all of ours in person. With Covid we did a hundred percent online. Now we do probably 50 50. We were doing probably more than half of our classes were public classes. Were open enrollment. Now, we probably do at least 85% of our classes are private in company classes because organizations that are using this technique want to get, groups of people trained.
And so we do more private classes, but there’s still a good number of public classes available to take. So look on the website and definitely sign up. We run certification on it. We’re now issuing cool looking badges so you can get your credentials on LinkedIn and get paid more, or whatever happens in that realm.
But if you come to the class, all I can say is bring an open mind and roll up your sleeves cuz we’ll be modeling in these classes. This is a modeling centric class. It’s not a, hey, here, let’s read this text and repeat it back to me. It’s Interactive modeling class.
And you learn a lot about the pattern getting in it, and I think it’s worth to do that. And I think you can agree with that as well as, when you take the class, getting into the actual touching, feeling the pattern through modeling it is really the way to, to get ahold of it and to learn it.
Shane: I limb by doing. Like anything in data, it looks simple, and the patent’s there, you got a hub, it’s a bunch of keys, that’s customer. And then you get whacked with a source table that’s party entity, and holds customers and employees and suppliers and contractors, okay?
How do you model that out, it’s simple, but again, going through that, learning by example is valuable cuz then those kind of complex patterns that are gonna hit you turn up and you know how to deal with them still takes a brain, but you’ve got the toolkit.
I’ll put , all the links in the show notes for anybody that wants to get a hold of you. Obviously you’ve got a book as well, so I’ll put that in there. Have a read of that if they wanna do that. And glad that you’ve got badges. Just the next thing you need to do is t-shirts,
cuz there’s no data vault t-shirt coming out from you guys. So I think that’s probably the next thing to add to your
Hans: I think so. And can I consult with you later about design? Because I was telling somebody the other day, I still have a batch of the most awesome t-shirts that I got from you. The optimal bi ones and there’s some great ones.
Shane: Yeah. We’ve, our new startup, there’s a new website called adt.style and there’s much t-shirts and I’m wearing one today. It’s the bring back data modeling, so if anybody wants a return of the data model t-shirt, just go buy one.
You can walk around agreeing with us that data modeling is an art’s been lost and it should come back. It has value. . So regardless of what your flavor of model is, just make sure you do it right.
Hans: Actually that’s a really good way to wind this thing up. I agree with you. I think that, right now, especially with a lot of the, lake and toss it all in there and it just automatically works. I think we all know as data professionals that’s never gonna be the case.
And so probably the most important thing of all from this pattern based podcast that you have is just do modeling. We need to get into modeling again. And there’s a lot of things that that are effective and can work. And what we’re focused on here in these discussions is what might work better for certain purposes.
And that’s it. But either way, we should all be doing it. And I’m also like you, Shane, as I have two sides to my vault store here. One is that. I feel like now it’s so much easier and so much stronger than I thought it was 10 years ago. I think it’s awesome and I think it’s easy. I think it’s really the best way to go for a lot of things, but at the same time, I am ready.
As soon as the next best thing comes out, I will jump on it in a heartbeat. And I think we all need to be able to do that. We all need to be able to say, Hey, I’ve done it this way for 15 years, but here’s something, let me look at it objectively. Yep. That is better for this purpose, and then please then jump on it.
I will commit to doing the same. We should all be doing that. I hate to be the guy saying, oh, I’ve always done it this way. You know that, that’s May as well just hang it up then. Because that’s not gonna get us anywhere,
Shane: That’s right. And we, we learn new things as we touch new data, just bring it back into the pattern and get the pattern to absorb that new complexity when we can. Excellent. All right hey, thanks for coming on the show and I hope everybody has a simply magical day.
Hans: Thanks for having me.