The Hook data modeling pattern with Andrew Foad
Guests
Resources
Join Shane Gibson as he chats with Andrew Foad on his data modeling pattern “Hook”
Listen on your favourite Podcast Platform
| Apple Podcast | Spotify | YouTube | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |
Podcast Transcript
Read along you will
Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.
Andrew: And I’m Andrew Foad.
Shane: Hey Andrew. Thanks for coming on the show. Today. We’re gonna go and have a conversation about hooks. I’m gonna suck outta your brain, what the hell it is and how we can use it. But before we do that, why don’t you give us a bit of background about yourself?
Andrew: Oh. Thank you Shane. Thanks for having me on. Right where to start? May maybe really quite early. ’cause I guess my data journey started very early on in my career. I worked in Manchester in the uk and I had a very enlightened team leader who introduced this concept of data modeling. Nobody was doing anything like this at the time in the early nineties, and it’s formed a foundation of my entire career.
So I’d use that in designing databases and that kind of stuff. I moved to Australia in 2008 and took a job as a Oracle developer. I’d done a lot of Oracle development, database design, that kind of stuff, and got made redundant from that job. First time that had ever happened in my career. Ended up taking a job as a business analyst, which was the worst possible thing I can could have done ’cause I’m really not a very good business analyst.
And then I got sick. I got really sick. And I was diagnosed with stage three Hodgkin’s lymphoma, and, and I ended up losing that job because of it. I went through chemo for about six months and I was outta work, so I got to a point where I was really bored, scratching around for a job and got a, offered a role for a small pharmaceutical company who were doing reports on data and they were doing stuff with Fox Pro and that kind of stuff.
And it was, it was a really good job, but they didn’t pay very well. But I had a great time and I actually built a solution for them that involved building an ETL tool. And this was before I even knew about data and analytics type stuff. So that can go on. That wasn’t sustainable. I couldn’t afford to stay there.
And I ended up applying for a role with a government role and they were asking for a data modeler. So I thought, I’m a data modeler. I can do that. And it turned out that they wanted a dimensional modeler, and I’d never done this before. Never done anything like this before. So this is the kind of how it fell into place.
So I applied for the job, didn’t have the credentials or the experience to do it, and got the role. Anyway, got the book, read the book Kimball and Margie Ross book and learned it, and that was my journey into data analytics. Off the back of that, it was all Kimball type stuff until I, I learned about this thing called data vault and it was like, oh, that’s interesting.
That’s an interesting thing, another type of modeling. So I tried to find projects in Brisbane. That’s where I live now, and there weren’t anything, there was nothing around about that time. 2013, 14 really wasn’t anything going on in that space at all. So eventually I managed to get a role with a consultancy.
This is five years later. It’s probably about 2018, 19, and that’s when I started getting into data vault projects and off the back of that and seeing the problems that there were with Data Vault. That’s where Hook came from. It is my attempt to solve the problems that I had encountered with the data vault, and that’s where I developed the ideas for Hook to have a simpler approach for data warehousing.
Shane: Yeah. It’s funny that Brisbane, for some reason was the ground zero for data vault in Australasia. It’s weird. Yeah. It seems to have the earliest data vault projects in this side of the world that I can find. Uh, it’s interesting. I’ve,
Andrew: but, uh, that’s, that’s.
Shane: Okay. Yeah, there’s a couple, I think a couple of PE key people.
I think Knowles, Sison and a few others. Yeah, I think they, yeah, I think they adopted it early in some Brisbane companies and then it just went from there. So it’s funny when you look at things and where they stick and start and get traction. Okay. Kimble to data vault to hook. Yes. What is Hook?
Andrew: Hopefully is my attempt to solve problems that I encountered with Data Vault.
So one, one of the key projects I worked on, and I’m might have to drop some names here ’cause I, I do that. I worked on a project while I was working with Deloitte, one of their clients, and John Giles was on that project. But I thought, this is great. This is one of my idols, absolute idols. He’d written literally one of the books on, on the subject, the Nimble Elephant, and I thought, I’m gonna learn something here.
Really. I’m gonna learn something. And the project, and I think John would probably agree to, this was an absolute disaster. It was awful. And then we had all sorts of issues. Now I won’t go into all the details of why it was a bit of a nightmare, but one of the problems is that Lloyd were put in there as a body to deliver a project.
They wanted to get to the end. Yeah. But they were thrown this bone that they had to deliver it in data vault. They wanted a data vault solution. Most of the guys in there weren’t really that well trained. But they did it all the same. And so they big consults, they never say no to anything. They never turned down.
They’ll never turn down work. So yeah, I’ve got involved in it. Stopped working with John and it was difficult. It was really difficult because we had three modelers trying to do modeling work. We had a team of engineers who were desperate for us to go and build stuff, and we were the bottleneck. We were always the bottleneck.
We had difficulties because we were working on different parts of the model and how to coordinate our efforts. So if somebody’s working on a. Dental model, somebody else is working on a physio model and they’re using different terminology. So it’s, are we talking about patients or clients or what’s the right language?
And because of that, we were just not getting stuff done. We were hitting bottlenecks. We haven’t had to have agreements to try and agree what terms are trying to bring models together. It just didn’t work. So it got to a point where I thought there has to be a better way. You know, it’s usually that question and some certain questions were like, why do we have to do the modeling upfront?
Why do we have to do this effort upfront to do the modeling so we’ve got a structure that we can target? Is there a way that we could perhaps not have to do that? So that was one of the main thoughts, and that came from a discussion that I’d had with one of the lead engineers. I said to him, it was a bit of a tongue in cheek question.
I said to him, I said, look, would it be acceptable if I gave you a satellite table that wasn’t connected to anything? No hub, no link. And he said, in all honesty, he said, yes, give me that because if you give me that, I can go and build the reports off the back of that. He didn’t care about the Volti stuff.
It was a hindrance. It was in the way. He wanted to get to the right hand side as quickly as he possibly could. So that got me thinking, okay, so if we maybe did that, presented the data in its raw format, at least the engineers could actually start to build something, and then could we layer the model over the top after the fact?
So that was the thinking behind hook, and that’s where it started to evolve from there.
Shane: So what is it though? So it’s a different modeling technique. So Hook is a bunch of modeling patterns, correct?
Andrew: I don’t like to say modeling ’cause I don’t think it really is modeling for me what modeling is when you take some data and you change it structure, that to me is what modeling is.
With Hook, you don’t do that. Whatever comes in is what comes out. The other end, all you’re doing is you are adding some additional information to it. So basically you are aligning those objects. Two business concepts. So business concepts that you would have for building hubs in, in data vault. It’s the same concept, but you’re not changing the underlying data.
You’re basically tagging those assets, aligning yourselves and formalizing the business keys.
Shane: Okay, so if we say though, that modeling is providing context to the data, and we may do that physically, we may do that logically with metadata. There’s different techniques on how we apply context and we may.
Change the data structure or we may not, we may change the data values or we may not. There’s a whole lot of techniques and patterns we can use to the way we apply a model. So let’s say that’s what we define for modeling for this conversation. And then I go back and I go with a dimensional model. I know I have a dimension object and I have a fact object.
So they’re physical objects. I can also say from a concept, from a context point of view. DIM is a core business concept, customer, product, employee, and it holds the attributes about that core business entity or core business concept. Next to it, it’s actually embedded physically in the same table, but it holds the key and it holds the attributes.
And then a fact table is the other object, and that basically holds kind of event concepts, order payment, those kind of things. And again, it’s holding all the attributes closely bound to it. I then go to Vault and I say we breaking the physical model out a little bit. We say instead of a dim, which holds the key and the attributes, we split it out into an ensemble.
Table to hold, the key table to hold the attribute. And then our link table, our relationship table, is effectively a weird form of effect. So if I can describe those in those terms, use that same. Pattern to describe hook, we
Andrew: only have really one object type, and it’s called a frame. I call it a frame. So basically you take your raw or your landed table and you wrap it, you frame it.
It’s a nice way to think about it. So the data isn’t transformed. It’s just a wrapper that you put around it. The additional things that you add to that are formalized business keys, which align to those business concepts. So you’re adding. Formal business keys and it’s the way that you build those business keys that makes hook what it is so that you get one table in, you’ll get one table out.
You don’t have to break out the model at all. All you’re doing is adding to it. So I guess the beauty of that is you could implement that as say a view. So you, you could create those hooks as views. Or you can have a physical table. You just add a column, calculate it, drop a column if you wanna get rid of it.
So it’s really agile in that respect. You don’t have to refactor the model at any point. You can add to it or you can drop things. If you wanna get rid of certain keys or you wanna refactor the keys or realign to different concepts, you’re not changing the overall structure. The granularity stays the same.
We are not doing anything complicated to it at all. We are just organizing it. The analogy I like is it’s like a library. You take a book, that’s your data set and you’re gonna put it on a shelf ’cause you want to go find it afterwards. So you put it in as a book. It’s on history. You put it on a section in the library, which is history.
Same kind of thing. If a data asset has information about customer, it’s got a customer key on there. You align it to customer, but you can also align it to order or whatever other concepts you have, you’ve defined. And that’s it. That it’s not complicated. It’s like data vault, I guess, but you’ve just collapsed all those business keys into the satellite.
That’s really all it is. Okay. Which effectively is a dumb right. No, because all the contextual information is there as well. Obviously with the Kimball model, you are having to restructure the data. You’re putting all your facts and your measures into fact tables. All your contextual information goes into the dimension you are juggling and moving things around.
With Hook, you don’t do that. It’s one for one. You don’t have to think about modeling. All you’ve gotta think about is where are the business keys? Identify those, formally define them, and that’s your model. And basically you’ve got a big bust matrix then concepts and assets or frames.
Shane: That’s it. So let’s do some simple stuff.
Yeah. Coming out of Shopify, I’ve got customer table and I’ve got orders table.
Andrew: Yep.
Shane: So we’re gonna grab the customer table from Shopify and the order table, we’re gonna bring that through effectively and any layer that we have. So let’s assume we have a layered architecture. Maybe we don’t, but let’s assume we do.
Each layer is effectively. Going to still have those tables. So when I’m at the consumer layer, the layer that the last mile tool hits to query, I’m gonna see a customer table and I’m gonna see an orders table. So that’s one of the core patterns that you’re talking about.
Andrew: Not quite. We need to talk about the scope of what Hook actually is.
So hook is about the data warehouse component. So it’s not about the ingestion. We are assuming that we’ve got patterns to ingest the data. It’s that data warehouse bit, which is, it’s the criteria, the subject oriented, integrated time variant and immutable. It’s that bit after that, you’ve got a consumption bit.
That’s when you have to do some modeling. But the idea is because you’ve organized things in the hook structure, what we’ve found is that you don’t need to do too much modeling. If you give that information to a business person, they’ve got the business keys identified in those tables. They can figure out how to join things up themselves quite easily.
So in the example you gave that you’ve got customers and you’ve got orders. So on customer, you’ll have a customer business key, I assume. So you formalize that key and you have a. HK Hook customer field in that table. So you straightaway look at that and go, I know that’s got a business key for a customer.
On the orders table, you’ll have one for the order. You’ll also have one for the customer. You’ll have two business keys formally defined on that structure. So straightaway, because I’ve got HK customer on both those tables, I know I can join them together. So you can just give that, I won’t say self-serve ’cause I think that’s a bit of a broad term, but you can give that to an end user and they can figure it out.
Yeah. So if they can do a lot of that legwork themselves, then when you want to do some specific modeling for specific use cases, because there’s some kind of more complex logic that you want to implement, it’s much lighter. You don’t need to do too much. And we use the term jam. As just enough modeling and the way you look at it, it’s like a pyramid.
So at the bottom of the pyramid you’ve got, we’ve ingested some data, which probably meets most use cases. You could probably give that to an end user, ’cause they’re probably quite familiar with the source systems that they’ve ingested data from. You got source system experts, they could probably work it out.
Then you add the hooks and that gives another layer of interoperability between those data assets. And then if we’re still struggling to, to meet specific use cases, we do the little tip on the pyramid to actually do those very specific use cases for the users. So it’s modeling light because you’ve organized, you don’t need to do too much modeling.
That’s the idea. But those model assets in the tip of the pyramid, really, you don’t need to do too much with them. It’s a small subset of assets compared to the bulk of the data you’ve ingested.
Shane: Okay. Alright, so again, just anchoring. So if, if we anchored typical, probably traditional now three tier data architecture, so three layers, let’s use the terms raw, designed to consume.
What you are saying is raw is persistent staging or whatever you wanna call it. It comes in, we ingested data, it’s stored. Whichever way that needs to happen. For that. The design layer, what we used to call the EDW, that is where Hook comes into its own. And then the consumer layer, we do some other modeling techniques to make the data consumable and reportable.
Okay. And so in that design layer, when you talk about hook keys in my customer table, are you saying that I’m going to append a column called HK and that’s gonna hold the key? Or am I gonna tag the column that has the key already?
Andrew: You need the value so you can actually join on it. So yes, you’ve gotta put a, I won’t say a physical because it could be a view, but yes, you’ve got another column.
You’ve got an additional column in there, but it’s named a certain way. And it’s calculated a certain way. So the, uh, the name of it, so you’ve got HK customer needs to be one of your core business concepts. So you’ll have a list of core business concepts. You don’t need an enterprise data model. You need an an agreed list of what those core business concepts are.
And, and that’s the same for any of these kind of techniques, even with data vault. But the thing is, that list is probably quite short. It’s not as big as people think. It’s not hundreds and hundreds. My belief is that number needs to be. Probably around about a hundred. It really is as low as that. And that probably covers most organizations.
I think even complex organizations, that’s probably enough business concepts. And there are techniques within Hook that enables you to do things like subtype, using a, a structure, we call it a key set. So this is a basically a qualifier that you put at the start of the business key to go, this is what this key actually represents.
Shane: So let me play it back to you. What you’re saying is, rather than what lots of people do, which is create the data model and then reverse engineer the data model into the business glossary. Mm-hmm. You define the business glossary. First, and then you apply that glossary effectively against the data to give us the model?
Is that what you said?
Andrew: All we’re saying is you can’t create a, a hook or Yeah, a formalized business key unless you’ve got a definition for that concept in your glossary. Yes, that’s right. It doesn’t stop you loading the data. You can still load that data, but we can apply that afterwards. But yes, that’s one of the hard rules in hook is that.
It has to be in the glossary. And you really should have a definition for it as well.
Shane: Yeah, we should, but often we don’t. Often we give it a name and then we define it after the face. And then you argue about, yeah, that’s right. And so therefore, by having a separate field in terms of a view or a physical field, that’s gonna allow us to deal with things like concatenated, business keys and all those other complexities, but we wouldn’t surrogate it, would it?
So that you wouldn’t typically have a surrogate as a. H key. It’s typically gonna be a business key.
Andrew: No,
Shane: you wouldn’t have a
Andrew: surrogate. What we’ve found is it’s, it’s nice to have something that’s readable. So you can put it up on a screen, you actually see it. But yeah, you could, you could, for example, build that string and then turn it into a binary or hash it.
You could hash it. Not a huge fan of hash it, I have to say. But generally, business keys are shorter than a hash anyway, so you can, you, you just got the full thing and readable. Exactly, and that’s the thing. It is readable. So that’s what you find. If somebody’s just query the table and they’re looking at it, you can actually see, oh, I know exactly what that key’s for.
And because it’s got a prefix, which is a key set, and you can also look that up. It tells you basically where that key came from. You can define it that way. Okay, explain it to me. So the key set, so if you can imagine you’ve got a customer 1, 2, 3, 4. So on its own doesn’t mean. Bag of beans. Really. If you have two systems that have the customer IDs in them two separate systems, you’ve got customer 1, 2, 3, 4 on one, and 1, 2, 3, 4 on the other, they’re not necessarily the same customer because those keys were generated on separate systems.
So what you can do is you can apply a key set. It’s basically a, a predefined sequence of characters, which tells you where that key came from. So you can go, this came from system A and this came from system B. So it’s a bit like the BKCC that you have on data vault, but it’s not quite the same. You can, it’s a bit more than that.
It’s not just distinguishing between systems. It’s given us something that we can use to basically give a bit more context around that business key. So the key set prefixes,
Shane: yep. Are held in a reference table. Is that what you typically do? We would need
Andrew: that to store that as metadata. So each business concept would have a set of key sets?
Yeah. So for example, one of the tables on the system I’ve just worked on, there’s a customer table and there’s two keys on it. There’s a customer id, and then there’s a customer code or customer number. So one is a surrogate key. The other one is an actual thing that gets printed on the. Customer’s Bill, they’re both legitimate business keys, so why not put ’em both in?
So to distinguish between them, you put a different key set at the front to say what it is, even though they identify to the same business concept, they are different ones. So you can look at one and go, oh no, that’s the id. And that one is the number. You can distinguish between ’em just by looking at them, so they’ll have a different
Shane: Yep.
But they’ve still both go in the same HK column.
Andrew: No, you’d have two. You’d have HK customer under, under ID and HK customer under underscore number or code or whatever it happens to be.
Shane: And then when we look at the order table, we’re gonna bring through both those customer HQ keys into the order table,
Andrew: only the one that
Shane: we’ve got
Andrew: in the orders table.
Only the one because yeah, we’re not gonna do any lookups when we ingest data because obviously that’s gonna slow your ingestion down. It’s gonna limit your scalability when you’re bringing the data in. You just bring the data tables in one for one, so you kick in. That can down the road a little bit in the, once you wanna start joining stuff back up together, you need to know which one to use.
But that’s why you use the qualifier at the end to to give it a name, to inform the user that I, yeah, I need to use the id. But at the end of the day, if they try and join on the wrong one, they won’t get any results because the key sets are different. An result, and they’re easier bugs to find when you get no data than when you’re getting some data incorrectly.
Shane: Kimball modeling dimensional modeling. We get templates for SCD type two, so we’ll proven templates that allow us to deploy that object, which we should all be using rather than just recreating it from scratch in a YAML file and db t and thinking that’s a good idea. And then within Data Vault, one of the benefits is there’s really only six bits of code Create Hub creates that Crate Link load hub loads at load link, and then all the exceptions.
For Hook. I’m assuming that the meta model, this model that holds these key sets and these keys, the kind of, I dunno what you call it, scaffolding or the skeleton or the config, that’s the bit that you can automate. I’m assuming that’s where it’s templated, it’s reusable. It’s a small set of hardened code that gets used time and time again.
Is that right?
Andrew: Yeah, so I built a little tool that does this. It’s cool. I call it hook studio ’cause it sounds real fancy, but it’s basically a little net application that, uh, desktop application, which at some point I really would like it to be a proper kind of web-based tool. Uh, vibe, vibe,
Shane: coding. Now come on,
Andrew: ai,
Shane: lovable vibe, coding.
Get in there. You can have it, you can have it in production by tonight.
Andrew: No, I, there’s people who can do that kind of stuff. I’m not the guy, but yeah, happy to talk to people about that. That’s absolutely fine. But the meta itself is. Straightforward there. There’s only a few tables, probably about 15 tables in the whole meta model.
Um, you’ve got hook, you’ve got key sets, you’ve got concepts and frames that, that’s the main ones. And you’ve got a few little frilly bits around the outside where you can put codes and conditions and that kind of stuff. But other than that, it’s a very simple model. So you just tag stuff. And then there’s little templates and engine in there which says, how do you wanna spit this out?
You can generate SQL scripts. What we’ve done at the last place I worked was we created YAML files, which go into a framework which. Does all the stuff. So you just go, here’s the definition of these frames. It goes away and just builds it and no effort required. It’s just a case of change. Metadata generate the YA file and it just handles it.
So if you wanna drop columns or drop cooks or change rename them or whatever. It does that kind of, so those are the kind of tools and it’s pretty easy to build. I’ve spoken to a few people about possibility of creating these tools. Most of ’em are, data vault died in the wall type folks, and they’re like, nah, not interested in talking to you.
But I’m sure there’s somebody out there who will take a punt on this.
Shane: Let’s just go back to those terms again to make sure I’ve got them. So key set was one?
Andrew: Yep.
Shane: What were the other ones?
Andrew: So there are four concepts. You’ve got business keys and key sets. You’ve got frames, and then you’ve got hooks. And hooks are what align the frames to the business concepts.
That’s essentially how it all stacks up. I’ve got pictures, but obviously on a audio podcast,
Shane: so lemme play that back to you. I have a list of business concepts. So I have a customer, I have a product, I have an employee, I have a supplier. I, uh, have basically a bunch of key sets that allow me to put prefixes against the IDs for those things to hold multi IDs if I need to, but also to determine the source system that’s giving us that business key.
So that’s a reference table that holds a set of unique keys or key types. Effectively, the hook is effectively the hk. It’s the formalized business key. Yes, that’s right. Okay. So that’s the fields I’m appending to the table that holds the keys I’m gonna use. Yeah. And then A frame is what, then? The frame is
Andrew: just the structure that holds all that stuff together.
So it is the view, or it’s the copy of the data, but it has the hooks included in it. I think it’s easy to view it when it’s a view. You go, you select star from the underlying table, and then you, you form the business keys or the hooks in there as well, and that’s the frame.
Shane: Okay, so within Data Vault we would talk about a hub and set being an ensemble.
Yeah. Within Anchor we would talk about ensemble. So it’s The Frame is effectively a group of those objects that we put together give us, it’s a
Andrew: simple wrapper, it’s a one-to-one with a source table. You put a wrap around it, but the additional fields that you put within that wrapper are the hooks. Which point to the business concepts.
Shane: Effectively, it’s a boundary. It’s an ensemble that says if you’ve got all these things together, then you can do the work you need to do. Yes, correct. And in your case, it is the source stable and the H keys and the relationship to the key set and the concept. Okay, that makes sense. That’s into the ensemble together, gives us the thing that actually works.
Yep. Okay. There’s gonna be a whole lot of things that this isn’t good at. What are they?
Andrew: Oh, well, I say that I was very, I’ve been very clear about saying what the scope of it is, because like data vault tries to do everything. It’s got solutions to try and do everything, and I think that’s why it’s hard to learn.
So my, my perspective, it comes in that data warehouse bit. That’s what it’s good at. That’s where it lives. So getting data into the warehouse. Is an engineering task, and it’s a very skilled and difficult thing to do because all our source systems are different. Trying to put a single pattern over the top of all those systems I think is impossible.
So why are we trying to do that? That sounds like a lot of effort when you just need to get a really skilled data engineer to go, here’s a API, here’s a CD, C type solution. That’s something that. It’s a very specialized skill. And those guys let the engineers engineer that. That’s the I’m saying, so it’s really focusing on that bit.
In the middle, it’s that warehouse bit. So once, so it’s basically the organization structure, which then makes it easier downstream to build the other assets that we need. The modeled assets, which, so the patent, I like to the term I’ve just come up with, it’s not ELT, it’s ELO. Extract, load and organize.
You’re not restructuring. There’s no T involved. And then somebody came back and said, surely there must be a, there must be a T in there somewhere. You have to do some restructuring data. And I said, yes, but that’s after the fact. But it’s a lot lighter. So it’s ELO and then a small T, not a capital T, because it’s a small amount of modeling.
Shane: So the heavy transformation work has been moved to the layer after the design layer, the consume layer. But what you’re saying is the hook modeling capability, the way you lightly model in that middle layer reduces the amount of transformation in the consume layer. Exactly, exactly. And that’s what we found.
Yeah. And also would increase the agility, will increase the ability for you to adopt change at lower cost and faster time. I haven’t thought about the use cases, the horrible ones that we know breaks every modeling technique to ask you the really hard questions. This is what we found though.
Andrew: One of the first projects that we used this on was I parachuted it into the project and they’d already built the bulk of the, the platform.
So they were ingesting data that it was, this is the whole Databricks thing, and then, yeah, it was going into a raw layer and then at the raw layer they were building facts and dims straight off that. It. It was horrible. It really was horrible. So I said, oh, you might wanna try this hook thing, because I think that’s gonna make that down the track a lot easier.
So we started doing that. We basically laid it over and they were doing everything virtually. They wanna do everything with views, which was okay. And it worked. We were able to build the frames. They weren’t called frames. Then I changed the name. They used to be called bags. So if you ever hear that term, it was a bag.
I never liked it. So frames seems a much better word. And we started building that out and there was no real impact on the architecture. We, we could just layer over the top. It’s just a nice layer over the top. And our dimensional modeling was easier, much easier. But what we found is that when those assets, those facts and dims are getting pulled into the BI tool.
The bi modelers would just take that information and they would just join the things back up together. So if you had, say, a project table, we’d produce a project dimension, a project fact, it goes straight into the, and they’d just join ’em back up together again. So the question was, why? Why are we bothering modeling?
Why are we splitting these things up? So we then started just passing stuff straight through to them, and they were absolutely fine with it. It’s absolutely fine. There were edge cases. There were things that they just couldn’t do in the BI tool. They were difficult things. I’ll give you an example. They were trying to do for projects.
They were trying to do a burn down chart. Now, the problem with the burn down chart, they were just basically seeing which tasks had been completed within a project. Now the problem is if you wanna do a burn down chart, you’ve gotta count the same tasks over and over on a different timeline. So the logic within Power BI was just too difficult.
So the Power BI developer said, can’t do this. Can you create me an asset that does that? I went, yeah, fine. So we basically took what we had out the hook database and we just built another asset, which did that kind of. Multiple aggregations cumulative over time and just delivered that and it’s had the same hooks in there.
Everything was the same. It is just that we built that very specific asset for that specific use case. Pass that over job done. So we only model by exception. We don’t model absolutely everything. If we can take it out of, if you wanna call it a medallion, I know it’s a horrible thing, but if you wanna just take things straight outta silver into Power bi, absolutely fine.
If you can’t do that easily, we create a gold asset. Which does the specific logic that we need for that particular use case.
Shane: Okay, so when you’ve got SAP and that horrible unusable data model from a BI analytics point of view, then you’re gonna have quite heavy consumer layer because absolutely there’s heavy lifting that has to happen.
Whereas if you’ve got a simple system where everything’s been typed nicely in that system, there is a customer table, there is an order table, there is a product table, there’s no complexity. Then what you’re saying is just add the hook attributes. So give it context to give it a view of business and then let them hit that.
And you don’t need to do any heavy lifting. Apologies to whoever coined this ’cause I can’t remember who it was. But if you do actually listen to the podcast, let me know and I’ll add it into the show notes. The classification was data recording to source data causing your business data, recording to consumption in terms of a layered architecture.
So what you are saying is the raw layer, the first layer is data according to source.
Andrew: Yeah.
Shane: The design layer actually is physically structured as data according to source, but logically or conceptually structured as data According to the business. Organized, yes. Yep. I would say
Andrew: organized around
Shane: those terms.
Yes. Yep. So we’re putting business context against the raw physical structure. Yes. Okay. And then data according to consumption is, if you can use. The middle layer, if you can use the hook layer Yep. Then use it. But if not, apply some heavy lifting in the, in that last consumption layer to make it fit for purpose for reporting.
Absolutely.
Andrew: Right.
Shane: Okay. Interesting idea. And so then I can see that I can, consumption layer could be dimensional. If the tool prefers facts and dims, it could be one big table. ’cause we could hydrate multiple of those as physical views. We could blow them away, we could rebuild them. We don’t care. It could be activity schema, you know, because we could effectively, if we wanted to do the windowing functions and get the value outta that physical modeling technique for reporting, then we could do that.
We could hydrate multiple physical models and they all virtualized of those models in that consumption layer. Yeah. Interesting. So for purpose model.
Andrew: Sorry. One thing we did look at was the USS, the Unified Star schema. We looked at using that and actually we ran a few experiments and that was actually quite good.
It works quite well with Hook. All you’ve gotta do is you’ve gotta, on each of those assets, you need to create a unique identifier surrogate. So you do that and then you just basically joining things up together using that Papini bridge in, in the middle. Yeah, we tried that. That seemed to be okay. But yeah, there’s different ways.
It doesn’t have to be that end layer doesn’t have to be a dimensional or whatever. It can be a. Flat wide table. It could be, yeah, another kind of hook looking table. That’s fine.
Shane: Let’s go through some of the other things that the middle layer tends to have to deal with his data. Yes, typically we would end date records or we would put a is current flag on, or we would use the A date and then use a windowing function to be able to see which record is valid at that point in time.
How are we gonna deal with that within the design layer using Hook? That
Andrew: isn’t a hook thing, but that’s absolutely something that we’ve done. The place I’ve just left is we did that, the bronze we. Loaded into, and we put the hooks on there. We actually put ’em in the bronze layer, so we had an organizational structure on the bronze layer.
When we then moved it over into silver, that’s when we applied that ro, effective from ro, effective to RO is current. RO is deleted. We added those fields on, so using those same techniques, using the windowing functions, but it’s all patented. It’s all the same. I guess the only difficulty that we found is depending on the source is what.
What time field do you use to represent your time dimension? And we can have all sorts of arguments about whether it’s a business thing or a warehouse thing or what you’ve been there,
Shane: so Yeah. Yeah. But the key thing is it doesn’t matter which physical modeling technique we’ve, we are working with. We always have to deal with the temporal nature of the data and how do we know a change was a change and how do we sequence the changes so they don’t get outta sequence when the source systems do bad shit to our data?
Yeah, yeah. And what do you do with late
Andrew: arriving data and that kind of stuff? It’s like same problems, doesn’t matter. No technique in the world is gonna solve that. No way. No. So yeah, that, that’s the kind of pattern we use. So that was our silver layer. So anything in silver, it has those standard columns on, you can see them.
So you have that time dimensionality. And to be honest with you, a lot of our users. Still didn’t get it. It was like, you’ve got history, you can travel in time. ’cause they were so used to running their reports on a particular date because they were using current data. So when we moved over to the warehouse and suddenly we’re giving them history.
You say you can go back to a point in time and run the query as it was at that point in time and they don’t always get it.
Shane: Yeah. Actually what we do is we actually provide two consumer layers. So let’s take that order table. We will provide a current state view of the order table. And we will provide a ized view of the order table and the reason as data people, we know that ized data has value.
We know be able to say, what was the order value? I. At 31st of March compared to now, because we know it may have changed. We know that, and we know that’s hard if we don’t bake that in as a core pattern in the beginning to hold that history. We know trying to retrofit it is a really large piece of work for us, so we wanna bake it in early because we know it has value.
But 80% of the time, air quotes on the 80%. The end users, the key, the information. Consumers don’t give a shit. They’re like, just tell me right now. I have an edge case where I need to know what it was a month ago, but nine times outta 10, I just need to know what it is now. So why do we give them this complex consumer layer that they then have to filter?
Why don’t we just give them two? If you wanna know what it is, hit this table. It’s smaller, it’s faster, and it’s easier. And if you want to go back and figure out what it was. Then go hit this bigger one and do some more work. Because you’re asking a more complex question. And that’s my view now is that we shouldn’t try to do one thing to bore the ocean.
It’s just stupid.
Andrew: Yeah, no, absolutely. That’s the approach we took. It was just one asset. So you have that rowers current flag you just filter on that gives you the type one behavior. It’s the same kind of thing, but it’s just an educational case. They’re just, just telling these guys how to use it. And I think, um, it’s been simple for them.
’cause once you’ve got those few rules and you’ve got the hook keys in there. They can work out most of the cases themselves. It’s just like they can see what they can join to what We don’t have to think about a data model. You don’t have to think about it.
Shane: And then master data. So I’ve got a bunch of customer records coming from system one and bunch of records coming from system two and they don’t have a shared key.
So now I’ve gotta do a whole lot of heavy lifting to infer that 1, 2, 3 is the same as a, B, C. I’m assuming that all happens in the consume layer because that’s the heavy lifting, but is there anything in hook that makes that easier?
Andrew: Nothing other than the fact that the way that you formalize those keys, at the end of the day, you’ve gotta create an asset which does the mapping between one key to another key, and that’s just another asset.
So you can think of that like a, I always think of it like it’s just another source system. We fed information into it. It’s done. Its magic. It’s trying to figure out some things based on what information it has, and it spits out a mapping, you just ingest that. It’s got the two keys in there. You just define hooks over the top of those keys in exactly the same way.
You can then use that as a, a bridge if you like, like a same as link and data vault. Same kind of concept, but end of the day it’s just data. It’s the same thing. It’s just a set of business keys that Matt one to another.
Shane: And then when we’re inferring data, so let’s say we’re gonna create a table with a bunch of attributes, feature flags, for an analytical model, say customer segmentation.
Andrew: Mm-hmm.
Shane: So I want to go and infer some values. I wanna create some columns that go number of orders over the last month was less than 10. Again, all that type of work’s happening in the consume layer, not on the design hook layer. The kind of logical
Andrew: architecture I always had was that, yeah, you’ve got the data warehouse layer and then you’re either going into a consume layer or you have some business logic.
You have something that does some crunching, that spits some data out, and then you can reest it. So those kind of, at the end of the pipeline that gets fed back in again. It’s just another source. It’s just that it sourced this information from the data warehouse, but it is treated the same as any other source that that’s the way I view it.
Shane: Yeah. And then I tend to use a, a language now of fact measures and metrics, and the way I describe it is fact is a number that came out of our source system. It’s a number, it’s immutable. I can go into the source system and I can see a screen that said the order value was $100, and so that is a fact. So the problem with these terms is it goes against Kimball.
That’s why people are look weird at me. But if we think about it from a business point of view, I have a fact, it isn’t a source system. It is immutable. That number exists. I don’t need to do anything with it. That number is there. So order quantity, order value, and then I have the concept of a measure. I.
So a measure is a simple calculation. It is a sum, it is a count. It is a divided by B, A minus C. It’s those simple formulas that we use. And then metrics are where we bring in complex business logic. So A divided by C, divided by B minus F. So you know, take the total revenue and minus the total cost, and then divide it by the number of periods to give me the average value over time.
Right? So I can see the facts are gonna come from source into the first layer. Let’s call it raw again. Then I’m gonna inherit those facts directly into designed, and then I can consume them the measures and the metrics. Because we’re inferring or calculating things, they’re gonna be in that consume layer.
So that’s where we’re gonna push all that heavy lifting again. But what you’re saying is that sometimes you’ll actually have a four layered architecture, so you’ll go raw. Then you’ll go designed where you’re applying the hook logic to give it business context, and then you’re gonna go into another layer where you start augmenting it or inferring some data that’s useful and then you’ll hydrate it into consume so that it can be used by whatever tool in whichever way they want, and that will be quite volatile and just and disposable.
Andrew: It’s interesting because we’ve had long debates about what’s a measure, what’s a metric? And so the definitions you just gave there to me, they’re almost the same say the difference between is complexity, but at the end of the day, a measure is something that’s calculated from something else. So is a metric.
And whereas a fact
Shane: I know, but the problem is I wrote a course to teach some people and I had to teach them what things were things. And so. I did a hell of a lot of research to go, somebody’s done this. Somebody has a language for these things that is so simple that I can just pick it up and I could teach somebody.
That was my theory and, and this is pre-chat GBT, so actually I probably need to do it again, but I couldn’t find it. I went through all the training material I’d done over my many years as a person learning. I went through all the books that I had. I went through everything, and I was like, okay, I can’t find a simple description, so I invented one now.
I can argue that a measure is the same as a metric, but actually I can give you a clear delineation between them and that that factor is a number that turns up from the source system that we don’t touch. That’s a fact. A measure is a very simple calculation, so it’s a sum, it’s a count, it’s a minus, it’s a plus.
It’s a divided by, and a metric is everything else that is really complex. So
Andrew: for me, they work. Okay. That’s fine. That’s fine. Yeah. It doesn’t matter what language you use, as long as people understand it and we’re consistent about how we use that language, and I think that’s often the problem that we have in our business, and especially with business concepts.
Even in that, you can talk about the same thing and two people could be, have totally different pictures in their heads.
Shane: I had Remco on ’cause with his book about Elm M logical modeling and the example he used was he was working in a hospital environment and he was running a workshop and somebody wanted to count beds and he, oh, that’s easy.
We don’t even have to worry about defining that ’cause the beds, one of those physical things that people lie on. And somebody said, oh, not for us. And he goes, what do you mean? And they said, bed is actually the room where the patient is. And it may or may not have a physical bed because we move them in and out.
But that’s what we count as a bed. And somebody else says, holy shit, no way. A bed is a thing with four wheels that we wheel in with the patient on and it’s made of metal. And so again, even in that environment, they had a slightly different interpretation of bed. Absolutely.
Andrew: I had that exact same conversation with New South Wales Health and we were doing a data modeling session.
And it was exactly the same thing. They said a bed for them was a as a location. I says, should it be called that? Because a bed’s got like a barcode on, you can scan it, it moves. Whereas, so what is another name for it? And they were like, oh, it’s a bed. And oh, is there another? And eventually they arrived at bay.
It’s a different concept and it’s just getting everybody to use that same language and that is generally problematic. Have you read any anything from a guy called Ronald Ross? Have you read his book at all?
Shane: The business language? The Business Rules? Yeah.
Andrew: Very good book. And he talks about concepts. The fact that they don’t exist on paper, they exist in people’s heads, and two people will see the same thing.
With the same amount of information and perceive them totally differently because of their experiences. So how do you align people with a concept to think of the same thing? I’ve got an example, and again, this won’t be very visual. It is visual that I’ve done a a training course where I said, do you know what a truncated dote icosahedron is?
Truncated icosahedron, do you know what that is? No. No. So if I was to give you a picture of one. You’d probably have a pretty good idea. So geometric shape’s kind of spherical shape. So I give you a picture of that. You go, okay, I’ve get a vague feeling. What is, if I gave you a definition for it, it’s a big, long thing saying, oh, it’s got this many faces and edges and all this kind of stuff.
But if I gave you a picture of a soccer ball, you’d go, I know what that is now.
Shane: Right?
Andrew: Immediately by giving examples of things is a really good way of aligning people’s thoughts in terms of how they perceive a particular business concept. Yeah, it’s really important. Language is important, but obviously concepts and conceptual understanding is really important.
Shane: That’s why I’m a great fan of Lawrence CO’s beam methodology because we effectively start by using beam. When we use beam, we start by capturing the data values as examples, and then we go and get a conceptual term for those values because now we have something we can point at and say, what is that? And then can I find it?
Can I see it? That’s customer. Okay. If I went to look for one, what would they look like? With data? People, we love to argue. Semantics is semantics. So what is a semantic layer? What is a data product? We can’t even agree amongst ourselves, and yet we expect the stakeholders to agree conceptual terms that are quite hard to understand in a drop
Andrew: of a hat.
The idea is just to be really light on the modeling because that, to me is problematic. I think it can be a bottleneck in a lot of endeavors in in day of day warehousing. So the idea is to try and take that out the critical path, apply it after the case, rather than have to do that upfront and build a model that.
Your source data doesn’t align to, so you’ve then gotta go and change it anyway.
Shane: Okay. Anything else you wanna cover off about Hook before we close it out?
Andrew: No, not at all. I’ll say that I’ve been speaking to Peter Hanson and he’s asked me to do a talk in Brisbane next week, and then he may mention of maybe going across to Auckland and Wellington next month.
Oh, that’d be good. That might be happening if I am in town. I will say hello. Yeah,
Shane: definitely. I live an hour outta town if you’re coming to Wellington. Oh yeah. So gimme notice. But yeah, I think the date engineering meetup in Wellington is two days from now. So we’re at the 5th of May, so it would’ve happened by the time this goes out, which means that it’s probably gonna be first week of June.
Potentially he was saying towards the end actually he was suggesting Okay. Towards the
Andrew: end, but I dunno yet, he hasn’t given those details.
Shane: Okay. If you can get the details before I publish this, I’ll shove it in and then people can come along and see a presentation about Hook and ask you the hard, gnarly questions that I forgot to ask.
Andrew: Deflect and I, they’ll deflect some more.
Shane: Excellent. So if people wanna find out more about Hook and more about you, where do they go? Good. I’ve got a lot of
Andrew: articles and arguments I’ve had with people, especially data people online. So go and have a look there. I do have a substack. It’s a little bit outta date.
I do need to refresh it, and I’ve got. The guts of a book as well, a short introduction, which I’m hoping to get out freely in electronic format in the next few weeks. So yeah. Good luck with yours. I’ll see it on the shelf there. It’s all really good. Yeah.
Shane: Yeah. Actually, for some reason, 2025 seems to be the year of data books, so there are a large number of very good books, including mine coming out.
So it’s good. It’s good that actually it’s the year of knowledge sharing. Yeah. Lemme know when your one’s out and I’ll let everybody notes here and they can go have a look at it. That’ll be pretty cool. Yep. Excellent. Alright. Hey, thanks for coming on and going through it because I’ve now got a much better understanding than I had an hour ago, which is always a good thing.
Always good. Yes. It’s all about, yeah, sharing is caring and yeah, sharing is caring. Thank you very much. That’s true. I think that’s one. Excellent. It’s not my term, but it’s one that I use regularly and I got no idea where I got it from, but that’s okay. Excellent. I hope everybody has a simply magical day.