3 agile architecture things
Join Shane and guest Brian McMillian as they discuss the art of architecture in an agile data world. We discuss 3 things:
1. the 4x approach
2. data vault
3. everything is code
Guests
Podcast Transcript
Read along you will
PODCAST INTRO: Welcome to the “AgileData” podcast where we talk about the merging of Agile and data ways of working in a simply magical way.
Shane Gibson: Welcome to the AgileData podcast. I’m Shane Gibson.
Brian McMillian: And I’m Brian McMillian.
Shane Gibson: Hey, Brian, good to see you today. And thanks for coming on the show. And this one, we’re going to talk about three things, we’re kind of gonna go on a really interesting journey. So we’re going to talk about the thing called 4x, Data Vault, and why it’s the best data modeling process in the world for now. And everything is code. But before we get into it, why don’t you give us a bit of a background for people who don’t know who you are, or your journey into this Agile and data world?
Brian McMillian: Thanks, Shane. My name is Brian McMillian, and I am a career Enterprise Architect been practicing enterprise architecture for about 15 years or so. My area of expertise is really in the data and business intelligence space. And I’ve worked primarily in large companies on everything from manufacturing to IT providers to the defense industry.
Shane Gibson: Excellent. Enterprise Architects in Agile and data is often seen as an oxymoron. We’ve seen as anti-pattern because a lot of enterprise architects tend to treat themselves as the cop at the end of the road. And works down and they come in and go, Oh, bad work, bad work, go back to the beginning. So it’s really great to actually talk to somebody that’s in the enterprise architecture space from large organizations, but has more of an Agile mindset. So what’s your experience with that? Do you find that a lot of the people with the role or the title Enterprise Architect is actually bound on a more waterfall type process?
Brian McMillian: Well, I think so. Enterprise Architecture has been around for a long time. And it has been around when as long as Agile has been around, Enterprise Architecture has been around. But most of the Enterprise Architecture in particular is geared towards large enterprise project, projects and products with that are pretty well established, or so large in scope that we think we’ve got so many moving parts, we need to coordinate them. And that lends itself well to trying to plan out as much as you can, even though we know that that seldom works. My experience with Agile and architecture comes from way back probably late 2000s, 2008 or so. Where I was on a team that was building a data warehouse to store server performance metrics, and ticketing metrics and call center metrics for electronic data systems. And we had just started to do ticketing work, we had a very large enterprise customer that we had, we were onboarding and their requirements were changing so fast. We basically just had to say, “Well, stop”, we can’t work the way you’re expecting us to work. And we’re never going to be able to meet your needs. And that’s where I got introduced to the extreme programming, XP Agile. And that just lit a fire. We can do this in an Agile way. We can treat the architecture incrementally. And I think that’s the big that, if you’re architects that you’re dealing with are not treating their work incrementally, then you have a problem. You can’t start from scratch and start planning everything. You have to do a little bit of planning, do a little bit of implementation, a little bit of planning, a little bit implementation, wind things back that don’t work. So that is really important. Now a little bit about my background. I didn’t come at the architecture job from the normal route, which is, I’m a computer scientist. I’m a programmer. I come from the application space. I’m a Systems Administrator. I’m a database person, I’m a DBA, that’s where most architects come from. I came from the business side. I’ve got a degree in economics. On that warehouse team, I was the business guy, did kind of everything from sales support to working with the customer to dealing with the requirements intake to doing demos for people. So I came at it from that side, this idea of product lifecycle is critically important to figuring out how you should be behaving, so I’ll make a controversial statement here. There’s absolutely nothing wrong with waterfall, there’s absolutely nothing wrong with Agile, there’s absolutely nothing wrong with somewhere in between, it’s wrong if you’re applying it at the wrong stage and that’s where this idea of 4x comes in. 4x is a little bit of a misnomer. Kent Beck, who was one of the original signers of the manifesto for Agile software development wrote the original test, Unit Test framework has been very active in the IT community forever. And he worked at Facebook a while ago, and one of his jobs was to evangelize the way that Facebook does their work. And to hear him talk to him, we can put something in the notes here some links to this, but to hear him talk about it, there was this idea, the fate, he came into Facebook, and they didn’t operate anywhere, like he expected them to be operating. And he set out some time to try to learn how they were operating. And what they were doing was, they were operating under this idea of a product lifecycle, which has been around forever. Product lifecycle is nothing new. But he coined this idea, this idea called 3x. So product lifecycle falls in three stages. And it’s a sigmoid curve, an ‘S’ curve, down at the bottom, low value, basically, everything is unknown. And this is what he calls the explore phase, you’re exploring, you’re trying to figure out what’s going to work, you’re trying to figure out who your customer is, you’re trying to figure out what they need, that’s that explore phase. And that’s where all these projects start. And then once you find a customer and you start to get some traction, you start to go into expand, and that’s where the curve starts to ramp up exponentially. Well, if you’re lucky, or unlucky, depends on how you feel that day. In that expand phase, you’re getting more customers, you’re providing more value. And things are breaking all the time. So it’s constant firefighting mode, and then eventually that will take you’ll get a handle on your product. And that will taper off and it will start to level off, and that’s what he calls the extract phase. Where you’re extracting the value out of the system that you’ve, you’ve reached a comfortable level of customers, comfortable level of value that you’re delivering to your customers. And then the 4x piece is one that I think is critically important, which is you need to exit from there, you need to think about how you shut things down. And that isn’t part of the original 3x model. It’s an addition, but it’s critically important. Because everything, every product you build is going to have to be turned off at some point, you’ll think about what’s a good example that. I think the best one is a car manufacturer. They have two to three year cycles, where they completely retool the car, if you bought a Honda Accord two years ago, or let’s say four years ago, and you were looking to buy a new one now, it’s a very different car, has a lot of new features. It is not the same car. And lots of Product Company, lots of hard product companies, big capital intensive product companies are really good at this. IT world particularly in the enterprise world, we’re not very good at this for some reason. We just have things that run forever. They get ramped up, they get turned on. They work, they’re stable, and we are terrified to turn them off. In IT world, we need to start thinking more clearly about that.
Shane Gibson: Yeah, I like that. So the excerpt was the one that tweaked my interest because we’ve all worked for organizations that have replaced or refresh the data platform every 5 to 10 years because technology change. There are an old CEO, on prem servers, or big expensive hardware and they move into the cloud or before that, they did the client server wave. But often you’ll see an organization that only have moves 80% of the workload off the old platform and the last 20%, which is incredibly hard. It still lives there. And so now they’re maintaining two platforms and getting benefits from the new one. But they just don’t want to spend the money to kill the old one, or the number of times we don’t monitor the usage of the information products we produce. So the dashboards, the reports, the API’s, we just build it, and we leave it and we let it go stagnant. We never monitor if it’s been used, and we never kill it. We never exit those products. And you see probably low effort, low cost to maintain, but there’s still a cost, there’s still an effort, the mess that they creates the dependencies. So we need to do that tidy up, we need to think about that exit. So the other reason apart from this idea of, how do we plan to exit the things we’re building at the right time? The other thing that I find really interesting is, if I think about applying the 4x’s with teams that are kind of adopting a new Agile way of working, data and analytics teams. It’s almost some kind of similar to the forming norming storming kind of idea. So exploring ways of working. We’re seeing what works for us and what doesn’t, we then get good at it. And we want to scale with a scale the team or scale the processes or scale what we delivering that we get to the stage where we got to automate it, the whole DevOps data ops just to extract the value and give more value than the effort we’re putting in. And then hopefully, we don’t exit I was thinking as we get rid of the team, that the medical software takes over and I wasn’t quite sure how to replies that the last week?
Brian McMillian: Well, I think you don’t. My recommendation is, so when you start getting into expand and you start ramping things up, somewhere in the middle, so let’s wind it back. So think about processes in ways the team works. The people all work differently in each of these different phases. You have people who are, let’s say, more comfortable in one phase versus another. So let’s just think about like project management, you’re trying to figure out what should we be working on. So initially, and explore, you’re in what I call Pre-Agile, business model canvas, lean canvas type things, you’re just trying to find a customer who has something that you’re interested in, and then work with them, then at some point in later explore, you can start to apply more Agile ways of working. You can start to do a teeny bit of planning or whether it’s a sprint, or however you do that, or if you’re lucky, and you are just moving straight into just straight flow, that’s great. And that will carry you up into expand. Somewhere in expand, you’re going to start to get a handle on things and you’ll be able to prioritize what you need to work on, you’re looking for the constraint, you’re fixing constraints in the middle of that. And you can start to switch over to things like critical chain, project management, where you’re planning out the constraint, working around the constraint planning the work you have to do around that constraint. Everything else is much looser, because you know what’s expected, and you know how important it is that you deliver this feature at this particular time. And whether you’re on track or not, it’s not so much did you hit the date? It’s are you on track, at that middle point, the middle of expand is when you should start thinking about, “Okay, we’ve got a handle on this, what would the next version of this look like?” And you need to peel off some of the team and start them back and explore. And they start thinking about features and functionality that would be in the next platform. And they start doing their work. So you’ve got one ‘S’ curve that comes up. And then you have another ‘S’ curve that’s starting in the middle, and will eventually replace and ideally what would happen is, when the top of the extract hits the expand on the next big thing, you can start to transition your customers to the next big thing. And then that exit now becomes a matter of winding those customers down, winding the infrastructure down. And it’s the same process as an expand, but in reverse. You’re figuring out where your constraints are, and you’re working the things that are not the constraints and turning those off. So you’ve wind things down that way, but it’s the same team the work is just starting to change. And you need to start planning for that. That’s how you grow one version after another version after another version. So that it’s not a problem.
Shane Gibson: So the other way, as you’re talking thought about it and taking it back to the enterprise architecture, and this idea of Agile architecture. So one of the podcasts I follow that I love is “Catalog & Cocktails” from data.world, and one of the listeners on the kind of came up with a term a while ago that they use quite often I really love and they talk about brakes on a car, there to stop you moving or slow you down. Brakes are on the car actually there to make you safe when you go faster. And so from an Agile architecture point of view, that’s the idea that the architecture skill set is not there to put barriers in place. And to stop and put gates, they’re there to put guardrails, they actually have to be in front of the work and go, if we’re going to go into that area, and that’s new we’re exploring, or we’ve done some work, and we’re expanding and then going into extracting, what are the policies and processes we need to be in place to be safe. And so from that point of view, if we think about as an architect, our goal should be to make ourselves redundant to not needed by the team at for that piece of work. That we’ve kind of gone worked with them, we’ve gone the guardrails around, you understand the things you need to do, the breaks you need to build to make yourself safe to go faster, and go for it. And then as you said, there’ll be another iteration where there’s a whole lot of new unknowns. And then the architect comes in to help the team or the skills in the team or that they can figure out what the new guy rolls out. So for me, that’s how we should be thinking as architects when we’re working with teams that are data and analytics focused and moving into that Agile way of working.
Brian McMillian: Yeah, and each of those phases, again, here’s where these phases come into play. Each of those phases has different sets of guardrails, with different strengths. I think another analogy, if you don’t like the guardrail, one that I happen to like is channels and buoys. So you’re on a boat, a big huge boat needs to stay in the channel. If you’re in a kayak, you can go wherever you want, what’s going to be able to carry the most people, well, clearly the bigger boat. And as that value increases, the one of the ways you manage the risk, so you start to bring those guardrails in, those buoys in, stay within the buoys and dig the channel a little bit deeper, so that you have the capabilities you need. And I think that one of the things that’s really important for architects to keep in mind is you need to be loosened explore, you need to be very strategic and expand, you may have some set of guardrails that you want to say a set of technologies you want to work within a set of technologies you’re going to work, you need to be very critical about where the constraints are that are popping up and be ready to shift if needed. One of my sayings, I’m rather fond of is, all software sucks. Your job is to figure out where it’s going to fail first, and work around it, you make a decision, there is, it doesn’t matter what product you pick, or what strategy you pick, you’re going or what pattern you’re following, at some point, that pattern is going to break down for you. And you need to be ready to recognize it and decide what to do about it. But as you move up that curve, though, it’s a little bit, you need to be much more strict, data governance is a huge deal. When you’ve got a product that’s in the extract phase, everybody’s counting on you. The data better be there, it better have a good solid interface, you should be paying attention to it though. Observability is critically important. In the middle, now you may have to loosen it up a little bit. Because things are changing so quickly.
Shane Gibson: It’s a term out there that people adopt around professional laziness. But for me, I tend to positive words. So I call it the happy path. And it’s this idea that if there are things a team can leverage that are fit for purpose, do the job they needed to do and are really available. They’re going to adopt it. It’s a happy path. It’s like well, that works. I don’t need to worry about it. I can go solve some of the problems that haven’t been solved. So again, from an Agile architecture point of view what we want to do is help them build these things. As you said, they’re buoys that keep them in their lane. The guardrails and the teams will adopt it because it’s easier for them. It make sense. They’re not going to go and reinvent, that will rephrase it. Most people won’t go and reinvent it. I have worked with a developers whose first job as they typically say, it doesn’t matter what the code looks like, I’m going to rewrite 70% of that, because that’s the mentality. But we just vote them off the island. So this idea that if things are in place, and they’re reusable, people will reuse them, because it just makes sense. And everybody is a good person. And they will do what’s seen as right for the organization and their teams. Kind of moving on from that one, because I was tempted to deep dive into data governance, but that’s a whole podcast on its own, which I’m happy to have. Because for me, data governance is definitely not it’s before the explore stage, we still had our decisions by committees, and I hate that. But before we do a deep dive on data governance, let’s move on to Data Vault, something I love. So for me, there’s been a wave of modeling techniques that have come out over the years. For me, Data Vault is currently the best technique I can find for managing data for a whole raft of reasons. It has some problems. And hopefully, somebody will come out with some new modeling techniques that give us the flexibility of Data Vault and solve some of the problems that it presents. But from your point of view, Data Vault, why not? What’s your view?
Brian McMillian: Data Vault modeling technique, I just love. I think it solves a lot of really difficult problems. And the first one is that it gives you a framework to think about the data problem you have, the lifecycle of your data from acquiring the data to actually making it fully functional and usable to whoever the end user is. So it provides a bunch of natural layers to cut the problem to cut the cognitive load for a developer down to bite sized pieces. And it does is a couple of ways. The first one is, and if you’re familiar with Data Vault, probably the thing that you think of first are the hubs, links and satellites, the three basic table types. The hubs represent your entities, those are your customers, products, sites, whatever the entities that are important to your business, they get expressed in a hub, and it’s every customer gets a record in the hub. And you can look at the hub table, and you can count star and see how many customers you have. Which is really useful metadata about the system. If you come into a Data Vault, first thing you do is you look at all the hub tables. Now, I know the domains that this that comprise this warehouse, then you have the satellite to go and on the other end, you have the satellite tables which are the details and that varies. Each domain is going to be different. And there’s there, it’s not very prescriptive at all hub table structure is very prescriptive satellite table is not, except for how you link and a couple of other neat little fields that they require you to have. And then the link table, it just a place for you to express the relationships, so one that happens all the time, those customers, I’ve got two customer records, but they’re the same customer. Well, I can express that relate that same as relationship in a link table. So that when I go to query the information, I can say, oh, I want the standardized customer name from the link table to join up with the data, the detailed attributes that are in my satellite table. And that’s the basic structure. And one of the great things for people who are developing in the Data Vault model or consuming the Data Vault model is they only have to learn those three, if you can learn what the capabilities are, in the purposes are of those three tables, you’re good to go. It’s very easy for you to maneuver around the warehouse, then you may have some extended joins to do, which is a problem with the Data Vault. It’s not like a fully “Third Normal Form” system where you may have lots of data in lots of tables that you have to join Cross, you basically have three basic tables. But you get the advantage of combining those tables up, you had a more denormalized system. So you can bridge different satellite tables across the link table or the hub tables. But there are lots of joins, which is a valid complaint about the Data Vault is that I’ve got lots of joins. But if you’re doing anything complicated, you’re probably going to have lots of joins anyways. And the traditional star schema helps to eliminate that. And if you’ve only interfaced with the star schema warehouse, Data Vault looks very intimidating and complicated, because there are lots of tables. But if you can’t find the data you want in your fact table and your dimension tables in a traditional star schema warehouse, you have to go looking elsewhere and it may involve rebuilding new fact tables or rebuilding the fact table that you currently have. And those kinds of changes are really disruptive to the warehouse. Modifying and migrating a fact table is a big undertaking. And if you’ve got it broken up and a little bit more denormalized, it’s much easier to make these changes. So you’ve got these tables, where you’ve got these three table types, very easy to learn, very easy to get your head wrapped around. And they’re very flexible, doesn’t really matter what kind of data you have. So that’s one set of layers or cross cuts. The next one that I think is even more valuable is that you have the Data Vault is split into these layers. Now whether they are schemas or whether they are individual. Frankly, they could be different databases. They can be different database engines, whatever is appropriate, your raw and staging area could be in a file based system, whether it’s a file share, or something like HDFS, Data Vault doesn’t care. Data Vault just says you need to have a staging area. And then you need to have an area called your raw vault, where you’re storing history of all those loads. Data comes in, gets loaded in the table, and this append only format, so that you have a new record, you have a brand new record, it gets entered into the table, that record gets updated, it is a new record, you don’t update the current one you have. And that is really powerful, particularly when you’re exploring your datasets. Because you don’t really know this gets everybody who spent any time working with data knows that there are things you just don’t know until too late usually. And being able to have a raw satellite table, where you’re stacking up your loads incrementally, and you’re just appending the change datasets, lets you go back, and it lets you unwind the data. So what did this data look like last month, which is really powerful from a reporting standpoint? I can tell you what the dataset looked like. And it doesn’t look like that anymore. And here are the three changes that happened over the last month. For troubleshooting, it’s also valuable, because you can see what changes have occurred. But that append only model is also very easy to work with. The type of SQL that you have to write to load that table is really simple. You don’t have to go look up keys, which is something else in Data Vault, it’s really make the difference between Data Vault one and Data Vault two. One of the big changes was they had to support no SQL databases, which means they went from sequence keys to hash keys. So figure out what your key is, hash that value. Figure out what your business key is, hash that value. And then that means that you can now load everything in parallel, you can load your hub and you’re sat in parallel. And because you’re calculating on load, what the key is, they can just go in seamlessly which speeds up your loads significantly and it makes loading the database much easier. And makes it easier for people who are not that comfortable doing the DBA, ETL work. It makes it much easier for them to get their head wrapped around what they have to do. I need to load the table I need to follow these simple patterns, and I can load my own table which brings us to the next layer. After you’ve got that raw table, through that raw schema, you’ve got a business schema which is a new thing was introduced in Data Vault 2.0. And that business schema is where you apply the business logic. In the raw schema, your tables, whether they’re hubs, links, or sats, you don’t apply any business logic. It’s the data you got from the source as is unmodified, but that’s never really what you want. So make those transformations from raw into the business schema and apply whatever business logic you need, and this is where it gets. This is where the table’s kind of balloon, but it’s worth it. You can have a hub. Let’s say you’re a manufacturer, and you have serialize and un-serialize parts, every analyst who’s queering that data has to make a common table expression or put a view together or something to filter out all of the parts that are serialized for a particular. Let’s say they’re doing some quality analysis. And they’re only interested in the serialized parts because they can tell it’s much more fine grained view of the data. And it would be much easier if there was just a table called “Hub Serialized Parts”, or Hub Part Serialized 001”, first version of that table “Hub Serialized 002”, that’s a different version, there’s something changed in the business logic. It’s like an API, where you’ve got version one of the API and version two of the API and version three of the API. Now you have multiple tables, but they provide different business logic. And it cuts down, that reduces the risk of you making a change that someone didn’t anticipate. If you’re consuming the first version of the table, and there’s a second version, your things don’t break, you get to make the choice. And this is something that we as data architects, we don’t do a very good job of, we’re really good at planning our migrations but we don’t think about what would the simplest way be? Just duplicate the data. If storage is really cheap, duplicate the data, version one of the API, version two the API, maybe we should treat all of our tables as API’s or as individual data products and just start thinking about the lifecycle of individual tables, that gets really interesting. Then those business tables, you got lots of business tables. But they all have different business logic applied to them. And it makes it much easier for an analyst to come in and go. Well, I want this flavor of the data and this flavor of the data, and I know that the business logic that’s applied in those tables is very clear, understandable, and the teams that are managing those tables, stand behind them. And I don’t have to go make some custom code with some weird query that may be above my skill level, all I have to do is just join the tables together and things get filtered out.
Shane Gibson: I think you’re right there in that. As technologists, we love complexity. We love to make things of beauty. We love to take your problem and just beat the snot out of it with the most beautiful solution we’ve ever seen where sometimes a simple solution will do it. And I think that’s the same with Data Vault. For me Data Vault is a set of patterns. So it’s set of patterns that have value, because they are the best, there are a set of good patterns that we can use. The Data Vault community has fallen into the same problem that I’ve seen with the scrum community. Is that with Scrum, there are some people that say, here’s the scrum guide. And if you don’t follow the scrum guide to the leader of the scrum guide, then you’re not doing Scrum, and I don’t agree with it. Well, I agree you’re not doing Scrum, but I don’t care. I’m saying it’s a bunch of patterns embedded within the scrum guide. They all have massive value use them if it fits you, if it doesn’t, don’t use them. And same with those. So there are a bunch of immutable patterns involve. There you have a hub, you have a set, you have a luck that’s about it. Everything else really is a choice as far as I’m concerned. And the thing that I like about articulated their hub set some links, there are only three types of data. There are really only six patterns, create hub, load hub, create set, load set, create link, load link. That’s a really simple coding pay. Now, there’s lots of complexity once you start throwing it at real data, we’re talking about bridge tables and same errors and links. And there’s lots of other things that we sometimes do to make it easier when we have complex data, but they just choices. And so what happens is we get into these religious arguments. So in our product, AgileData under the covers, we use volt-ish. We don’t use the term hub sets and links in the product, because what we found is, our audience has an analyst. It’s data literate person that’s on the business side, not the engineering or the technical side. And so we found that those words get them. So we call them concepts which is a hub. I have a concept of a customer and a call and their business concepts, they can go out and say, I can see a customer, I can see your product, I can see your store, we have detail about those concepts. It’s a set. The customer has a name, the customer has an aged, customer’s agenda, the store has a name, the product has a skew, and then we talk about events, rather than links. And we do that because we say, if you looked at your organization, what business events or core business processes can you see? Oh, I can see customer ordering a product. Cool. Well, that’s great, these three concepts or three hubs that are embedded? So for me, that’s really valuable, right, those patterns, are they just so simple, and they so flexible. We deviate a little bit from the DV 2.0 Metro. So we actually use a history layer that isn’t vault, there’s not vaulted. Typically, you call it a persistent staging area. And so we don’t vault that data, we don’t break it out into hubs and sets. And there’s a whole lot of reasons we don’t apply that pattern. But we’re okay with that, that’s just a choice we made and has costs and consequences. And the same thing, as we spent a large amount of time on that last layer there, what we call the consume layer, or people typically call presentation. And so we said, okay, so we don’t want our analysts to have to understand how to join all these tables together. So what we can do is, we can create views or tables that make them consumable, we can denormalize a link out to its to its hubs and denormalize that out, again, to include all that sets and give them one big wide ugly table. Now, performance is an issue sometimes, but with the cloud databases that we have now, they pretty much eat anything up to a terabyte without even blinking when you’re denormalize, it’s just a thing of beauty. So for me, Data Vault is a pattern. There are a small number of immutable patterns, that if you’re not doing that, then you aren’t doing data vault, the modeling technique. And then there’s a whole lot of ways of working patterns that have come out in 2.0. And from my point of view, they have value adopt them. If they don’t, find something else that works for you. So the thing that people often say is, I prefer star schemas over Data Vault, because it’s easier for the users to query. And I’m like, great, do a Data Vault, and then put a star schema on top of it.
Brian McMillian: Yeah, and that’s what the standard says. So you’ve got your business vault, then you have your info vault, and in the info vault, you’ll have big wide tables, if that’s appropriate. You’ll have traditional dimensional star schema data Mart’s built, whatever is appropriate. And that’s really important to know. And I think that’s in the layering in the Data Vault standard. That’s one of the things it’s really important to break out. You’ve got your raw persistent staging area, which question for you on the getting to your history layer. Are you doing it? You’re doing an append only model there?
Shane Gibson: Yeah, so we merged the source system tables. There’s the structure of our history layout, we do upsets only. So we’re effectively inserting new change records as we see them. And we’re just not voting it. We’re not breaking that table structure out into a hub in a set, like the doctrine says, but we were not doing it for a reason.
Brian McMillian: I will agree with you what I’ll tell you, I’ve done the same thing. And in fact, in the book. In the raw layer, that first table that gets built staged and built that first set table, there are no hubs and link tables built, because frankly, I don’t see the need to. In most cases, what you want to store is you want to store the history to go back to. Now it probably makes sense to break out hubs and links in the business layer, because you’re trying to apply some business value. And there may be some value in saying, I have to alias that. I have to alias this entity. So that we get the right product names, and it is optional. And I see almost no use particularly in link tables in the raw layer. I don’t see a use for that.
Shane Gibson: So I spent an inordinate amount of my time a while ago trying to understand raw versus business. And I don’t see myself as a dummy. I’m not highly technical, but I can read other people’s code. I think I have an innate skill to see patterns. And when I see a pattern, what I focus on is classifying what makes the pattern and what makes an anti-pattern for when that pattern should be used, when it shouldn’t be used. And rule versus business killed me, as like, well, is rule its own database and business’s own database and now we’re duplicating data. And if it is, I don’t care. That’s cool. We store it twice. And it has value, I don’t mind. But there’s not true because actually, what the answer is, will businesses just another set of hub sets and links next to raw? There has business logic, so I’m like, “Okay, so what’s business logic”? If I’m conforming customer from two different source systems, that’s a business thing. That’s a high level of complexity. And those rules, where we’re making stuff up, we’re inferring things that aren’t seen in the data. So that’s a business rule. But if I’m flagging apart to say, and stock out of stock, is that a raw thing? Is that a business thing? Well, I’m fearing it. So it’s a business thing. And I got to the stage where I went, I don’t actually care. All I care when I’m working with a team is determine what code you’re writing, we’re using and what it’s doing, and then determine how you identify the pattern of that code, and then how you want to classify it. And if you want to classify it, these type of changes or rule changes, and these type of changes that business changes, good. And as long as you stick to that pattern, , so for example, if you say that conforming customer is a business pattern, its business code did I should never see conforming of an of an entity or an identity or a concept in my role layer. Because now you’re broken the pattern. Now you’re confusing me. If you say, we inferred flags, their business function, they’re a part of the business rule. So they should only be in the business layer. I should never see them in my rollout. Because now you’re confusing me again. So again, set the pattern, write it down, agree with everybody and follow the bloody things. And so that’s where I got to. So I’ve thrown away the rule versus business definition for. For me, it confuses people at has little value, we should focus on the type of changing we’re making the size of that change, and then how we describe that as a pattern. And then when we see it as a pattern in terms of the x’s, how do we move that from an explorer pattern to an extract pattern? How do we automate that pattern, because that’s what the beauty of Vault is? As I said, six pieces of code, create, or load, times three, the ability to automate that is amazing. That’s where the value of it and the Data Vault pattern is.
Brian McMillian: Yeah, and also to know, you don’t have to apply, you’ve been saying. I think one of the things that you’ve been saying is, you don’t have to always use all the objects. If you’ve got the consumer layer, and that needs a different structure. If a humbling layer probably isn’t useful at the consume layer, it’s too complicated. So you’ve already made the decision. So talking about the 4x is again, if you’re an explorer, stage the data, start collecting history, and then start building out your info layer, that consume layer, go straight, don’t do anything in the middle, just do whatever it takes to get the data visible to your customer. So that you can start making judgments about the quality of the data you have, and actually what data you really have? Then you may fall back and do some intermediate modeling, which may have hub zones and sat’s. Then you go refactor the info layer that consume layer, you have to have that flexibility. And that’s part of getting that agility in your data architecture is no, I’ve got these structures, these buoys that I can steer around, these guardrails I have to sit in and what’s appropriate for me to do right now. I don’t have to use all the things. I don’t have to do them in this order. And once you break free of that, then it becomes and this is the thing that is really spectacularly productive about Data Vault is, if you break free from the sequencing of the objects, you get a lot of flexibility and a lot of agility. You can change your mind.
Shane Gibson: Yeah, it’s ability to absorb changes is amazing. And so that’s why I go back to the Data Vault modeling pattern. The pattern of how to model using Data Vault and store that data and load and change that data, it’s amazing. It is a thing of beauty. Everything else that’s been put around it with DV two, I think it degrades the value of the model. So I’ll give you an example. DV two says, big modeling upfront. It says if you don’t understand the core business concepts, and you don’t model the organization, you get a source specific vote, and that’s bad. And I call bullshit on it.
Brian McMillian: I must have missed that chapter. No, it absolutely is. But my first experience building a Data Vault from scratch with a team of people who are not database developers went through this pattern, get the best data. Get the data the companies using, start putting that in, start putting that in the database and start looking at it, we found out very quickly that it wasn’t going to work that the data was bad. The counts of parts that everybody in the company was using were wrong. Because the system that they were getting the data from, that implemented some incorrect logic, and they were miscounting things. And that’s just the way that it is. So we scrapped everything, then we got a new set of data that we thought was much better, and we found new problems. We scrapped probably half of that. And then we got the third set of data that solved the problems that we had. And we went forward with that. And that kind of iterative development is really conducive to, forget the modeling pattern that you’re using, you have to start thinking about it incrementally, and you can’t get attached to you, if you do a whole lot of upfront modeling, whole front of upfront design, you are going to be sorely disappointed when it doesn’t work. And the goal is get to that point as quickly as you can, you’ll make it through the crappy chasm of doom, where that you make that inflection point from explore into expand. That’s really painful, and you learn lots of things in that area, and it never is going to get better. But it will get better, you might need give it a few months. You need to be prepared to just jettison the work that you had done. And no amount of planning will prepare you for that.
Shane Gibson: And I agree. So make the point earlier that as architects, when we design the beautiful architecture diagram, it’s our baby, and we don’t want that baby to get hit. So don’t you deviate from my diagram, but we’re gonna get rid of that. It’s like a guess, as soon as we put it into the first engagement, it’s gonna change. Same with our models. Sitting for six months in a room doing the perfect model is not an Agile mindset. It’s not an AgileData mindset. We want to get the data in front of our customers early to get value. So if we can go and say we have a concept of a customer. So populate that hub from a source system. Salesforce, that’s where the first master of customer comes from, grab that, get a hub for Salesforce with a customer ID, and set the customer name. Whack, they’re on a report and say to them, here we go, you can now count customers. As it today, you had this many. As that last week, you had that many that has value. We underestimate how much value that really simple piece of information is to have a single answer to the same question of how many customers do we have? That’s massive value, and then we can incrementally build out. Now, should we stay with a source hub? No. We should then bring in another system that has customer, we should conform those to say, we’ve got two different definitions. How do we get a single answer to the question of how many customers do we have across the entire organization? And that’s a harder problem, but we build that next. We deliver that value first, and then expand on the value. So for me vault is cool. It’s a cool modeling pattern. It’s simple, enables us to change which is what it’s all about. But it’s just a set of patterns we can adopt the way we want and very few of those patterns and vault are immutable hub sets and links, they are co-found. And he comes from a more dimensional background, and he’s particularly Agile from his patterns. And he keeps saying to me, so why can’t we just have sets and no hubs? And I’m like, here’s the five reasons we never have a hub. And it’s like, there is a set of immutable patterns. We will never have sets and no hubs for a whole raft of reasons but running out of time. So let’s go on to the last one. Everything is code.
Brian McMillian: This is a good transition. So one of the problems, so let’s get straight this way. So we’ve been talking about doing it upfront planning, and breaking free of that upfront planning process. I will contend that a lot of that upfront planning that people do is forced upon them by the tools they are using. If you came from a traditional data warehouse, enterprise data warehousing background, you were using, let’s say Oracle was your data warehouse platform, Informatica was your ETL tool. And what does Informatica have going for it? It’s got a nice drag and drop interface where you can grab an object, configure some parameters in a nice GUI, and then connect that to some other thing. And then connect that to some other thing and build out your pipeline visually. What about your data model? Well, before you do the ETL work, you need to have a data model in place. Because Oracle’s not going to work, it needs something to connect to, it needs to get some metadata from your database in order to be able to work the way that needs to work. So that means you need to have a data modeler. That data modeler is going to use what? They can do the ER diagrams in what tool?
Shane Gibson: AI sparks.
Brian McMillian: Yeah, that works.
Shane Gibson: Because we’re picking the big ugly hard too.
Brian McMillian: So your data modeler is going to go in and they’re going to model the data. And they’re going to work with it, how are they going to know what to do, they’re gonna work with the business analyst. So you start to build out this waterfall pattern. And a lot of that is reinforced by the tools at best, you have to model the database before you can use the ETL tool. Well, that puts you in a waterfall pattern that makes it difficult for you to make changes. And if you were to take a different so we’re going to talk about treating everything as code. So what would that model look like if you’re going to treat those steps as code? So we’re going to model the data that we need to get some data in the database, we’re going to model that. Well, in your example, if we’re going to copy the source tables over from the source system into our staging area, and we’re going to do an append only type load, insert only load. Like you said, the SQL code to do that is pretty easy. Your bulk loader for your database may create the table for you automatically. You may just pointed at the data set and say, bulk load this data in it may create the tables for you. It may even create the database for you, depending on the tool you have. So that becomes much easier. Now that process didn’t require you to do any modeling, because the table reflects the data that you loaded. Now you have to do that Load Script, if it’s a bulk insert, it’s just going to naturally just append the data. And then you might have to do some cleaning up afterwards, or now you want to move it to the next table. Well, depending on the type of movement type of pattern, you’re going to have you load your SAT, load your hub, whatever it is, that’s another piece of SQL code. It’s pretty clear. So now I need to write the SQL to create a table, I need to write the SQL to load the table. I’m an ETL developer, it’s pretty easy code to write. Even if I’m not very experienced and comfortable with it. I can look at the last table that somebody else that the person who was more skilled than me did, I even copy that pattern. It’s pretty straightforward. I mean, that’s a revelation to someone who’s only queried the databases, someone who’s got good SQL query and skills. It is not a heavy lift, to learn how to create a table and how to insert data into a table. It’s really easy to do. And once how to do that, then that gives you a lot of power to do things like create your own on tables, and it’s very clear what’s going on. And it doesn’t require a third party tool, it’s code. Now the advantage to having that now you’re code based, your ETL is in code, at least the tasks you’re executing are in code. And creating the database tables, the DDL statements are all in code. If you adopt the pattern that we don’t do table migrations, one of the features that the ETL and the data modeling tools have going for them is that they’re very good at automating the table migration process. They take the data from table one, they put it in a temp table, they restructure that data into table two into 1B. And then they load the data in there, you’ve migrated the table to the new structure. That’s one reason why you have these ETL tools. But if you were to adopt the pattern that just says, I ‘Version one’ of the table in ‘Version two’ of the table, it’s pretty straightforward. First thing you’ve got to do is move the data over from one to two, that transformation, but that’s a one time job. And then I’ve got a job to load table one, job to load table two. Again, it’s all code. Now, because it’s all code and I didn’t need a GUI tool to manage that, I can now put this in source code control. Makes it easy for me to share. Try sharing your work if you’re an Informatica user, it’s not easy. It’s very easy to step over somebody else’s project. But now you’ve got an in source code control, you can start to adopt software development practices like version control, and having collaborative development. Now, whether you do trunk based or you go through the whole Git flow PR request, please use trunk for data work. Please put tests in place, but those are all code. And you can start to now leverage a lot of the software development tooling that’s been in place for decades. And you’re not spending money on these visual tools. You’re not locked into those tools. It’s much easier to do collaborative development. And it’s much easier to make incremental changes and move a lot faster. And that’s a big shift.
Shane Gibson: So I’m going to disagree with you strongly and agree with you strongly in the same sentence.
Brian McMillian: Let’s see if I can guess what it’s going to be. Go ahead.
Shane Gibson: So everything is code. So I agree that everything we should do should follow a pattern that we follow when something is code. So you’ve described some patterns. You’ve described, we should version things. We have a set of things, let’s think of it as code. And they’re running and we want to make a change. We should version those changes. So we can go back, we can go forward, we can see what the differences, that’s a pattern. We should be able to collaborate, we should be able to see what our colleagues and our peers are doing, we should be able to work together on the same thing, we should be able to potentially split off and do parts of it and bring it back together. Although that is a high pattern with high complexity. Again, I’m not technical but I can read code. I hate get mergers. I hate merge conflicts, because I look at it and I go, why did you let two of us heck the same thing and now make it my problem to fix it? Give me a tool that makes my life easy, not harder. But that collaboration, that is not locked away on a developer’s desktop, and nobody can see it, that’s valuable. Testing, we learned to test code for many years, but we hardly ever test our data. So again, testing is a pattern. Automated testing, regression, testing, validation, assertion based testing, those things are important. And we should do it when we do it with data, we should treat data as if it was code. So I agree with all that. But I don’t agree with is that we can’t have a low code interface that provides that. What we’ve got is a bunch of legacy products that haven’t actually given us the rigor, and all the tools and pens that we need to do that now. So that’s where I go disagree. Because at the moment, we’re stuck in a world where only writing code is the way we get those patterns and those features that we need version control and collaboration testing, the DBT world. I don’t think everybody should have to write code. But everybody should get those features when they’re working with data and when they’re working with code. So agree, strong enough the patterns, don’t agree with the jump to actually it requires us to write code. But that’s just the current state. But we’re going to change that with AgileData.
Brian McMillian: I agree with that. There is a point where you need, particularly if you’re working from the front and back, that having the low code, no code type of use case, is really powerful. I mean, there’s a reason that Excel, and we’ll just stick with Excel, before we go to things like Tableau. There’s a reason that Excel is the number one BI tool in the world, because it strikes a good balance between giving you the power to do things in code, like functions. And particularly, if you think about the things the new Power BI features, and all of the data features that are now embedded into at least the Windows version of Excel, give you the very nice low code environment to work with. And that’s really appropriate for a lot of use cases.
Shane Gibson: But if we look at our Vis tools, we look at tableaus and Power BI’s and the clicks of the word, the ones that were the previous wave of visualization tools. We don’t get a lot of the features that we should if we’re treating visualization as code, we don’t have to write code, but we should treat it as code. So it’s hard to visual.
Brian McMillian: You can get the XML from that way.
Shane Gibson: Because those tools don’t have the features we want, we are forced to extract the code to do what we want. So let’s look at them. Can we version them? A little bit. It’s got better. We have some ways of getting versions of those visitors and rolling forward and back, but it’s not great. Can we collaborate on them? Yes. Typically, now well apart from Power BI and its desktop thing. We’ve got to toolset, survey base, browser base with some collaboration. We can start working together. Can we test them? Hell, no. Show me a Vis tool that allows me to write a test that says the number in that KPI widget is actually right? There’s no testing frameworks for those Vis tools.
Brian McMillian: Well, you can write a report that tests the data that underlies that dashboard.
Shane Gibson: I can write a test data goes that going into the dashboards, but I can’t write a test to confirm the number on that dashboard at a point in time as right there, nobody’s applied a filter or inadvertently added another calculation to skew it. So again, if we say that with the patent is everything is code and the rigor that we bring to the way we develop software, the DevOps, all that stuff should be brought to the data world, they’re not agree with you. And at the moment, we’re in the chasm again. We’re going to jump over and a bunch of new tools are gonna turn up that adopt that. But again, my view is it doesn’t actually have to force us to write code, it just has to give us all the features that we get when we do write code.
Brian McMillian: If we can get there, that would be fantastic. You can envision a world where you get both the best of both. And I think that there are a lot of companies, if you look at some of the modern data set companies, there are quite a few that are moving that way, particularly in the orchestration piece, I see this a lot. You’ve got the ability to build your data pipelines out visually. And there’s data behind, there’s code behind it, there’s metadata behind it to do things, but that is a lot of work and we got a long way to go before we get.
Shane Gibson: And just looking at time, let’s close it out. So we can move back to this idea of 4x and architects. Because what the data stacks given us is a whole lot of new cool capabilities. It’s great, but what it’s also given us as an infinite amount of new complexity. In the old days, we used to buy one or two tools to rule them all. Now we have to take 15 to 20 different products, moving parts open source, software as a service platform as a service and cobble them to give it to give us all the patents that we need. So it’s kind of a great time and a bad time for data and platform architects and the data and analytics space because our worlds become more complex but hey, what do we love doing? We love solving complex problems.
Brian McMillian: We don’t break shiny things.
Shane Gibson: Yeah, ideally is try and solve them with simplicity. And more importantly, think about the 4x is, we’re at the Explore stage. That’s cool. But how are we going to exit some of the stuff when the next wave of technology comes. And some of the things we’re building and 5 to 10 years have to be exited and replaced with things that have more value? Think about it how tightly coupled, how interchangeable is the things we’re designing, because new stuff will turn up there has massive value, and we need to be able to exit the thing at the end of its lifecycle and introduce the new thing without rebuilding everything from scratch. So 4x for me, it’s a nice set of patterns.
Brian McMillian: Great, thanks. It’s really changed my whole view on how I think about problems. And it’s the first thing when I talk to someone, that’s the first thing introduce, you have before we’re going to talk anymore, we got to talk about this 3x, 4x thing. And then other things will make sense from there because that’s the perspective that I’m bringing.
Shane Gibson: And it gives us a shared language. So again, Data Vault, there’s only three things you can talk about. So there’s only a certain amount of language you can use. So that shared language means we’re on the same page to begin with. And then we deal with all the complexity that we has uncertainty. So using the 4x as a baseline for a shared language, where are we at and which one ‘X’ is that we can all right now is a great way thinking. So thanks for your time. It’s been awesome.
Brian McMillian: Thank you.
Shane Gibson: And we’ll catch everybody later.
Brian McMillian: Thank you. I appreciate the opportunity.
PODCAST OUTRO: Data magicians was another AgileData podcast. If you’d like to learn more on applying an Agile way of working to your data and analytics, head over to agiledata.io.