Conceptually Modeling Concepts, Details and Events in AgileData
Join Shane and Nigel as they discuss how and why we define a conceptual model of Concepts, Details and Events in AgileData and how we map these to a physical Data Vault model.
Concepts, Details, and Events: These are the three types of objects in AgileData. Concepts are things like customers or products, Details describe those concepts, and Events are relationships between concepts at a point in time.
Data Vault Modeling: This approach aligns well with traditional Data Vault modeling. Concepts are like Hubs, Details are like Sats , and Events are like Links.
Dimensional Modeling: This approach also aligns well with traditional dimensional modeling. Concepts are like Dimension keys, Details are Dimensional attributes, and Events are key relationships off a fact table.
Combining Data from Multiple Sources: We discuss how to combine keys from different systems into a single concept or separate them based on rules. This can help in creating a single view of a customer or other entities.
Automation and Extension: Data vault modeling allows for easy automation and extension of tables, avoiding common problems in traditional data warehousing.
Agile Approach: By taking an agile approach, teams can model data quickly and adapt as needed. This allows for incremental value and the ability to tackle specific problems as they arise.
Effectless Facts and Events: We explore the idea of events that are not driven by numerical values but by relationships, and how this can be modeled within the data vault paradigm.
Rules for Master Records: By applying specific rules, the complexity and ambiguity in source data can be removed, leading to unique or master records.
Business-Friendly Terminology: The use of terms like concepts, details, and events makes the data vault approach more accessible to business analysts and non-technical stakeholders.
Read along you will
Shane Gibson: Welcome to the “AgileData” podcast. I’m Shane Gibson.
Nigel Vining: And I’m Nigel Vining.
Shane Gibson: And today Nigel and I thought we would do a little bit of a deep dive into what we call concepts, which are one of the three types of objects that we create and store data as within agiledata.io. So the three objects that we store are “Concepts, details and events”. And the 32nd thing that I use to describe them as concepts are a thing we care about something we need to manage. So an example of a “Concept” will be a customer, a product or an order. “Detail” is something that describes those concepts that provide some context. So it might be the customer’s name, the products type, or the order value. And the third object is an “Event”. And an event is where a group of concepts have a relationship at a point in time. And they are normally articulated in the way of the core business process. So we would say there’s an event of a customer ordering a product, and we store data in a certain way when we see those events. So Nigel, we under the covers, , we everything we do in the event layer is stored as either a concept, detail or an event, once we’ve defined it and config is that a lot of workers is when we implemented that thing, was it months and months of programming to get those pattern for those objects?
Nigel Vining: Actually, the data vault modeling tends to be very nicely aligned to the actual code under the covers to make these things work. I guess I was actually just going to start by preface this with my backgrounds Kimball’s. So when I historically model data, I always thought in terms of dimensions and Vex. But actually, it was a very small learning curve and adjustment to me because effectively a concept is the key offer dimension, detail as effectively or the attributes offer dimension, and the event as effectively all those keys off a fact table. So for me to go from facts and demos world to a concept data events was actually really easy. But what I loved most about the data, Vault C’s, D’s and E’s is actually wickedly easy to automate under the covers because the pattern for C’s really straightforward. It’s basically a table with a whole lot of keys in it. D’s is not much more complex, it’s a table with a whole lot of attributes in it. And events, again, is actually a table with multiple key columns in it. And so under the covers, we can automate the snot out of creating and maintaining those. And we can even extend them on the fly really easy, we don’t end up with that situation known to developers all over that once you’ve built your dimension table and deployed it. A month later, someone comes along and says, we’ve missed an attribute out it out of it, we need to put it in what you end up with is dimension tables in mature warehouses, which have your date columns out on the right, and then there’s another 2, 3, 4, 5, 6 columns after those because they’ve been added on after the fact, the data vault paradigm lets us effectively extend out our detailed tables at any point with no real hit.
Shane Gibson: I think it’s important to explain that under the covers, we use the majority of the data vault modeling patterns, the reason we use different terminology. So anybody that’s used to data vault will be familiar with hub, sets and links or hub satellites and links. And so for us a concept effectively as a hub detail is effectively as satellite. And a link is effectively an event. The reason we call them concepts, details and events, though is our audience as analysts and business people. And so when we talk about hub satellites and links, those have quite technical terms that really don’t make sense to them. But when we say we have a concept, which is something your manage it, you have a concept of a customer, you have details that describe that customer and that customer is part of the business process. There’s an event that happens at some stage involving them. The business people understand those terms. It was funny, I was on one of the data vault meet-up things that we do globally every year and it was meant to be in running this year, but unfortunately, it was remote. And we were having a bit of a chat about the software market at the moment and how data vaults been around for a long, long time. And lots of people were using it in their data platforms as a modeling technique, but the tools themselves are pretty immature or have a low sense of adoption. And my standing joke was, why because if you’re going to create a data vault, you only need six code, create concept, create detail, create event where you populate the type, create the table, load concept, low detail, lot of event. So there’s only six bits of code to make your whole data platform work with vault.
Nigel Vining: Yeah, that’s exactly right. And even if you come to a level of standardization under the covers, you can actually get that down to two effectively, which is create your table and update your table. Because the patterns, they actually share some commonalities, and you can just smudge the lines a bit deep and simplifies it. It’s funny you say that I actually I’ve completely had completely overlooked the upsets and lengths because I’ve been calling them concepts, details and events for so long. I actually forgotten this where they actually came from?
Shane Gibson: Well, maybe we’ll come out with Data Vault three, and we’ll rename them probably not. In the data vault world, there’s lots of arguments on the detail. As I say, a group of architects as an argument of architects, a group of data, Vault modelers, and endless arguments about particular use of a single field. So for us, we have adopted some things that work for us, for the way we work with our customers and the way the Google platform that we leverage under the covers make the best value of that. So let’s talk about these concepts. So, Nigel, in the dimensional world, we have these things called dimensions. Concept is almost like a key on the dimension, but sometimes it’s not. So I’ll give you an example. Dates, we don’t have the concept of a date. And we don’t hold that as a concept. A date is not a thing, we manage the data as a piece of detail about something else. It’s the date that an order was placed, it was the date that the customer paid the invoice. So that’s bound to the date of the invoice. So we don’t hold dates as concepts like we did in dimensional modeling. So there are some slight tweaks in terms of the way we use the data about modeling compared to the way we do the dimensional modeling. But you’re right, it is fairly similar. And that also means doesn’t it there, once we’ve done, , this modeling and the event layer, we can quite easily present all the data in a consumable way as dimensional star schemas, again, if we’re using a tool that consume star schemas by choice.
Nigel Vining: Exactly. Because it’s not very much. It’s no effort whatsoever for us to put a scene or debate together in color the same with the event table, we can easily join an event table back to its C’s and D’s, and call it effect. So by effectively utilizing a pattern and easy to implement pattern, we can turn a C, D and E back into a collection of Denton facts in the tool that sit on top of that BI tool of preference. We’ll read those quite happily and not realize that under the covers, it’s actually something else.
Shane Gibson: And I think the other thing is when I’ve worked on some of those dimensional only data warehouse projects that didn’t go so well. We were commonly doing something called an affectless fact. So what was it?
Nigel Vining: Affectless effect, so affectless fact, we were typically trying to join something together Strathcom an example of one put me on the spot, affectless effect. There’s nothing to measure. But we needed to join two things together. So you threw that one in there?
Shane Gibson: Actually, I’m trying to think. I normally treat them as hierarchies but that’s probably not a good example. So they’re an exceptional model. Like there is an exception to the fact modeling technique to the dimensional modeling technique, the effect was facts that was something that was used, where there was no value for that fact happening. So it wasn’t an anti-pattern effectively. And the point I was trying to get to is within the way we model you can have an event that’s not driven by an order value, you can have a set of concepts to have a relationship at a point in time. And that event concept that event table we have where we say these concepts have this relationship as at this point in time is valid. It doesn’t have to be numerical number driving the fact that that relationship happened. There just actually has to be a relationship and the data to say that it has to be.
Nigel Vining: Absolutely. And it’s quite a nice real world thing, something happened. These three things intersected at this point in time a person and something they did or click of an example there either.
Shane Gibson: I’ve just missed the Google in it, student attendance and the class.
Nigel Vining: So I guess you could say it. You could count that and say it’s one attendance. But that’s basically as far as the metric they would have be attached to that fact. You could say, a student attended a class counted once. But otherwise, you’re right. It’s just three keys a student that class. I guess the attendance is actually the account they attended.
Shane Gibson: I think effect effects. That was weird. , there was a count of one. And the one didn’t exist in the data. So we had to make it up. Whereas for our events, student attends class, that’s classic event. We just have the student concept, and the attendance concept and the class concept. And where we see a relationship where those three things happened in the source system, somehow, we store that as an event record to say that at this point in time, these three things had a relationship, we saw them happen. So it’s probably some complexities when we start talking about concepts. Because our source systems are never clean. They’re never beautiful, they don’t make our life easy. So in your experience, what are some of the things that you’ve seen that are a little bit more complex patterns, or techniques or recipes we have to use when we’re creating these concepts?
Nigel Vining: I guess the one that springs to mind, what if we have multiple source systems feeding into our data warehouse? Are we combining those keys together into a single concept? Are we storing them separately? What do we do with keys coming from different places?
Shane Gibson: So the answer is, “Yes, we are one of those”. So they’re just choices, that what we can do is we can define source specific concepts. So we could have a Salesforce CRM that has a customer record of it. And maybe we’ve got zero financial system that’s got a customer record. And Salesforce is holding the events around communications and service calls and zeros holding the records around invoices and payments for those customers. So we can create two concepts, we could create a CRM or Salesforce concept, and populate that with the Salesforce data. And we could create a zero or financial customer concept and populate that data with zero, we would be very, we would recommend that they have different names. So either financial customer, and CRM customer, or Salesforce customer and zero customers. So we know when we’re looking at the data, which concept we’re looking at, now, we may then want to actually have a single list of customers. So what we can do then is create another concept called customer or single view list of customer or master customer ongoing, yep. And then we can take the two concepts, and we can slam them together. Now, if the key that identifies a unique customer in Salesforce and a unique customer, and zero, well managed and synchronized, so if your customer 99, and Salesforce, you are a customer 99 and zero, then just basically slamming those keys into that one concept. There’s magic. And we’ll do it all for you. 9/10 that’s not true. 9/10, the way we describe those customers or identify them, sorry, not describe them is different. So then what we have to do is write a rule to say how are we going to deal with it? If the way we create those identifiers is unique for each system, and they don’t overlap, there are no collisions. So customers in Salesforce. So we start with letters A, B, C, D, E, F, G, and customers are 0, 1, 2, 3, 4, 5, we can send them to the same contract. But we won’t get to see the duping so McDonald’s and Salesforce and McDonald’s and zero, they won’t give us back the same all the records from both systems, all the events. So what we do then is we have to write a rule. And that’s a matching rule that says, how do I know that McDonald’s and Salesforce is the same as McDonald and zero? And then what do I do about that, and we end up creating one record one concept record for McDonald’s from one to six. We may find out that, funnily enough, there’s actually six McDonald’s customers in Salesforce, but they’re all the same customer because we just use the system properly. So we’ll write a rule there to again create a single ID for McDonald’s, and then that single list that concept.
Nigel Vining: Okay.
Shane Gibson: That’s just one piece technique and recipe that we use for that. The other one then is really important is this idea of details. So we have details coming out of Salesforce for McDonald’s, and we have details coming out of zero for McDonald’s. So we may have an address, the addresses may not have a line. So we will normally create a detailed table with the Salesforce detail and a detailed table with a zero detail. So we can always see what the day looks like based on what the source system looks like all the data factory and then we get we have to make some choices. What do we pick as the single address that is the master? And again, that’s just rules. There’s the last one that was updated. If we’ve got six McDonald’s, do we say the address that seen the most often is the winner? Do we say actually Salesforce is normally right, and zero is normally wrong. So if you see one a Salesforce user, if you don’t grab the one from zero. So again, the adjust rules that we do identify what’s important.
Nigel Vining: So you’re effectively using rules to remove the complexity and ambiguity and your source data. So by applying the right rules, you’re effectively getting to your, what you call master records and it’s just a rule.
Shane Gibson: Because everything we do is just a rule. And if you’re a consumer of the data, you don’t want to have to typically choose because you don’t understand the rule you need applied. Normally, to figure out what’s right that work should be done by the data magician in the event layout before it becomes consumable?
Nigel Vining: You just want to see the unique list of customers and the current address details.
Shane Gibson: And I always thought this isn’t going to happen that often. But I read a research paper a little while ago. And what they were saying is “Organizations between zero and 50 employees has on average, 7.8 source system, 7.8 data factories”, which blew me away, and over 50 employees, I can’t remember the number, but it was a lot more. So if you think about 7.8 systems, how many of those are storing information about the data customer? Unusual not to have to combine this data at some stage.
Nigel Vining: Absolutely. So this is some way that the concept of concepts and rules would really be beneficial for these organizations to clean up their data presented back in a guess what you’d call a single view of customer.
Shane Gibson: And because we take an agile approach, we can do it incrementally. So if reporting on customer from Salesforce is the most the highest value piece of information for the organization right now, just create the concept of the Salesforce customer start reporting on that we’re about zero. But if actually, combining the customer records across two systems, or three systems is your most highest problem, your biggest pain in the bum right now and the highest value, or de-duping and the customers from one system as your problem, again, you just write rules that create concepts that deal with that problem, and we have the flexibility to change it later on. One of the other recipes or ends, we can use as you can, from the beginning, just create one single customer concept. And don’t do so specific ones. And when the keys come in, always write a rule, make sure that the identifiers for the customers are unique and are combined. And we don’t hold them separately, that’s okay as well. We tend not to do that ourselves, because we want to incrementally show value to our users. So creating the concept of a Salesforce customer and allowance account ID and query it has value for them. And we can do that quickly. And then doing zero on its own has value for them and we can do that quickly. And they can use that data while we then figure out the rules to create that simple list. So that’s one of the things we’re really keen on from an agile way of working as you can chunk it down into small bits. And when you get a rule that takes you some time, there’s some value there already for the consumer.
Nigel Vining: Great.
Shane Gibson: So that’s the concept of a concept. We struggle with that one a little bit because concept has a terminology around concept of a concept. So I was always wondering if we’re gonna come up with a different term. But actually, the idea that the notion that we have a concept that we manage we want to count, and it has value, that’s pretty useful. And the cool thing is, I’ve worked with teams before, where we have our analysts modeling our data for us, without falling back to that age old pattern of the grumpy Enterprise Data Modeler who sits in his little window office for nine months to come out with that perfect model that nobody can implement, and nobody can use. So for us, it’s about model fast model will and then change it when you need to.
Nigel Vining: Absolutely. Fantastic. So that’s probably concept. So next time, we’ll dig into details, which are the second part of this equation.
Shane Gibson: Dig down on to the details.
Nigel Vining: Detail of details.
Shane Gibson: Excellent. Well, thank you, and we’ll catch up soon.
Nigel Vining: Thank you.
Shane Gibson: Let’s make magic happen.