Data Layers, the magic of the right data in the right place
Join Shane and Nigel as they discuss what data layers and how we apply data layers as a core part of the AgileData.io Agile-Tecture.
Guests
Resources
Podcast Transcript
Read along you will
PODCAST INTRO: Welcome to the “AgileData” podcast where Shane and Nigel discuss the techniques they use to bring an Agile way of working into the data world in a simply magical way.
Shane Gibson: Welcome to the AgileData podcast. I’m Shane Gibson.
Nigel Vining: I’m Nigel Vining.
Shane Gibson: And today we thought, we would have a chat about what we call areas in the agiledata.io or typically in the world of data, they’ll be called data layers. So, in the old days, when we did dimensional data warehouses, they were really some standard layers that you have within your data platform, that pretty much everybody you knew, what they were doing, used to build out the data warehouse. And then we went into the layer two big data bollocks. And, we got a whole lot of new layers and a whole lot of new ideas that were not really proven. And what we’re seeing now is a move back to more of a managed layer approach within the data platforms, with some areas of variation, where some of the agility you can get out of that data layer type of behavior has some value. So we thought today, what we’ll do is have a bit of a chat about the areas or the layers that we use within the platform window, and why we’re doing it that way. And some wisdom of words from Nigel about how the hell he engineered and plumbed all this. Sound good.
Nigel Vining: Sounds good to me.
Shane Gibson: Alright. So for us, the first layer or area that David tunes up in is a bucket. We made a decision that regardless of how we acquire the data, if we’re pulling it off software as a service application as an API, or we’re having an internal database and bring the data across, and whether we’re using Change Data Capture, or snapshots and build in, that landing the data using a same pattern, into the same area have value for us. So for us, we learn that data as either CSV or JSON into a bucket. And it’s the first time we see it when that data is actually dropped into that bucket.
Nigel Vining: In the benefit of the bucket, it even have some of common spaces, then we can leverage a common archiving and error handling pattern because basically, files will turn up in the bucket. We load them, if they load successfully, we say great, and we archive that file and put it aside. If it doesn’t, we move that file off to an error bucket, where we can retry and reprocess. But it gives us a common place to put files and a common place to archive files, right at the start of the process.
Shane Gibson: And so when you talk about error handling, one of the things we say is load the data, as much as we can, we try not to reject data. So we don’t apply any data quality routines, and (Inaudible 03:28) and put those records into an error handling bucket, that an analyst then has to go and do some work before that data turns up. But there are scenarios where the data isn’t fit for purpose for us, in terms of I talked about data mutation. So you got an example of where we’d probably get something, where we actually error or fall out.
Nigel Vining: So we would commonly error file is, where we’ve been happily receiving a daily customer extract, for example. And it’s got 20 columns and the data types are all known. And then one day what happens is if that customer file turns up, and now it’s got an extra five columns, or some of those data types have changed. So the first thing the pass that as it goes, this doesn’t look like we’ve got, what we’ve got, I’m going to put it aside and raise it that we need to do something with this file.
Shane Gibson: And that’s actually one of the things we’ve got on our backlog is to bring some analytics and some algorithms into automate that process. So when we see certain times a mutation, a mutation of schema, how the hell we can get our algorithms to automatically inherit that mutation, and bring it through, so that we remove that manual process and it just becomes magical. So those CSVs in those JSON files, right, they turn up, why did we pick CSV and JSON?
Nigel Vining: Because the platform under the covers that we’re using, natively supports CSV and JSON files, and CSV/txt files. Pretty much the one common format that every database software as a service product uses is, a common file format to share to export import data. And increasingly, JSON tends to be the more common one for our customers as well. But CSV.txt, it’s been around forever, it’s still the most common pattern that you can move data between platforms relatively easily.
Shane Gibson: And the cool thing was, because we could just lay that CSV in and have it plugged straight through, since we built the file drop capability, that means people that had a bunch of Excel files, which we know is still one of the largest data factories and an organization, they could just drop that file as a CSV, and we’d load it. So we’ll probably look at maybe some parquet stuff in the future, if we find out that there’s a bunch of source systems or tools that want to publish the data out in that format. But right now, CSV and JSON has got us through. So in true Agile form, only do it has value and when you need something new, build it. So data comes in, gets put into that bucket. And then the next thing we do is, we have a temporal data, layer or area called land and, this is actually hidden from anybody apart from ourselves. So the data is not persisted in there. And it’s not actually visible to anyone else. It’s really just a small area we use for some plumbing. So why do we do that? Why do we have this temporal little area that doesn’t persist forever, and nobody really sees.
Nigel Vining: I guess the benefit of laning is that we’re successfully bought in our file, files from a customer. Once they are in tables, then we get a whole lot of benefits around that, it’s got a little bit of structure to it, we can run profiling over a table a little bit easier than maybe over a file, we can basically have to have a first cut of what this data before we move it up through the layers. So it really just gives us the patent of that data is now on a table. So we can use it a little bit easier rather than radiate inside a file.
Shane Gibson: And so for me, if I go back to my favorite analogy of a restaurant, we’re kind of doing a pre check. Before we move the goods, we’ve received them, we want to check that the goods are really what we thought they were, before we move them into the storeroom and hold them forever.
Nigel Vining: So we’ve actually unpacked and effectively we’ve taken the raw input into the kitchen, and we’ve unpacked the box. And now we can actually see what we’ve got, because it’s now in a format that’s a little bit more obvious to it.
Shane Gibson: And also, it allows us to do some really caution around data resize. So, we may grab receive data, and it may come in every 5, 10, 15 minutes. But the customer may have a requirement where the slides or the data is only refreshed every half hour, or they want to deal with it slightly differently. So again, it gives us the ability to do some of those things on the way through.
Nigel Vining: So that’s a good point, because what we automagically do under the covers is every file that comes in or multiple files, although they’re going into the same table, they’re actually going into separate partitions in that table. So we’re effectively logically stacking them up. So even if we don’t process them straight away, for some reason, basically all those files have been stacked and partitions in that table. So if we unpack one partition or 20 partitions, an hour or two hours or 24 hours later, that data is effectively preserved, how it’s arrived, and we have time stamped it. So that’s another feature of pinned it to a table.
Shane Gibson: And so in terms of the data layers we do, it’s probably an unusual pattern to structure the data. So really, we’ve got some backlog stuff where, if we could figure out a way of doing a more variant data type, as we bring it into the landing area, we may make it less structured, and then actually move the parser of that data to structure further up the layers. But that’s actually one of the key things again, as we talked about, in the last podcast, about refactoring, we know that we may end up refactoring this particular feature, further up the layers later and that’s okay, because it’s designed using the PacMan analogym so that, we can just move it, and we’re okay. So data comes in, we hold it temporarily, we profile that we made sure that we understand the structure of it and it’s safe for us to move on. And then we move it into history and history for us is persistent. It is the single source of corporate memory for all data. So every piece of data that comes in, goes into history, and it stays there forever. We never delete data out of history. So let’s have a chat about, first of all, what are we doing to reconstruct that data in history? And then why do we keep it forever?
Nigel Vining: So first of all, history is our persistent store. So we effectively move the data from available in history. At that point, we identify the unique business key that holds that data together, this is probably the most important thing we do in the whole layers process. Because once we have a unique key for the data, we can uniquely identify it anywhere in the process, and it’s always safe, we will never accidentally change the crane of that data, we will never overlook a change for that data because we have a uniquely identify it.
Shane Gibson: Yes, and that’s again, one of the areas that we probably going to refactor in the future, we were going to move that key further up the laye,r to reduce some effort on the analyst, and the history layer. But the key thing is that history is racking and stacking all changes over all time, and then enables us to do a couple of things. So at any time, we can say, what did that customer record look like on the 31st of January 2019? What did it look like on the 31st of January 2020? And what’s the difference? What had changed, have the customers address changed, had the date of birth changed, which, address changes probably okay, date or a birth probably want to start looking at and saying that the data is mutated. And also it means that at any time, we can completely rebuild the upstream layers, which we’ll talk about in a minute, the event and consume layers, because we have the original data that came in and the way it came in. And so if you’ve got a Data Factory, like Salesforce or one of those other tools, as you change those records, it may or may not record that change, it may just update and never actually be able to tell you what it looked like, the week before or two weeks before. So by bringing that data in and storing all those changes in history, it means we have that immutable corporate memory. And if we need to, we can rebuild everything from scratch. So that gives us a whole lot of safety around that.
Nigel Vining: Yes, that was the point i was looking to before data unlocked the ability to rebuild from history, because we have it all, we can change any rules above it and rebuild all those layers with no risk at all, essentially, because we’re basically recreating layers on top of history, because we have all the history. So there’s very little risk in doing that we don’t lose any granularity or changes, because we have all the data to build from any point we want.
Shane Gibson: But if I think about, Okay, I’ve got some large pieces of data coming in, and they’re coming in every day. And racking and stacking and storing limitless history thing. And that table, that data is just going to get bigger and bigger and bigger and bigger. So get really slow. So is there some engineering magic that you’ve done to, stop my queries getting slower and slower every day?
Nigel Vining: Yes. So under the covers, we use the built in feature of the platform we’re using, which is partitioning of the tables. And we effectively keep all the current data, whereas the data looks like now in the source system and the current partition. As those records change, we pack them into history partitions, they’re still instantly available in Greubel. But we’ve effectively pushed them off to the side. The platform stores them in an optimized way, they can grab them later. But we’re actually only dealing with the current state data for 99% of our queries, because that’s what users generally expect. So partitioning helps us reduce cost, improve performance massively, but cleanly store our history off to the side with no real penalty of doing that.
Shane Gibson: And in the old days of legacy data warehousing, we used to have a staging area or persistent staging area. It’s not a new pattern. It’s not a new layer. I think back then the problem was our databases couldn’t handle holding data for all time. So we used to have to archive it or back it up or delete it or move it off. So one of the benefits of the new cloud serverless platforms is they give us technology capabilities we haven’t had before, which means we can go back to knowing robust patterns but use them in a new way.
Nigel Vining: Exactly. That’s what about our comments rate, a nice segue, because effectively, a petition table with your historical data, partitioned off over time effectively under the covers, that gets moved off to cheaper storage, if you haven’t used that historical data within a set period of time, it effectively just moves off to the side. Again, it’s still instantly queryable or come back up again. But if you haven’t touched your archive data that might be one year, two year, five years, 10 years, that’s effectively all off in, cheaper storage, because you’re not using it anymore on a daily basis. That’s a nice feature that cloud gives us, we don’t have to physically worry about my team space, data space, user space is filling up on my database. Because it’s just not a consideration anymore.
Shane Gibson: And one of the many pieces of magic we use to keep our costs down and therefore provide our magical t shirt pricing. So Okay, we’ve got the data in history, we’ve got this corporate memory or data over all time. And then we want to make it usable, we want to apply some context and understanding to that data. Because the data factories may store the data in tables that make sense to them, they’re quick and easy for them, to allow you to update that information and search on there screens and find it. But when we bring it into the data platform, under the covers, it’s pretty ugly. It may have that horrible what we call party entity. So may have one table that holds customers and employees, and partners and vendors and everybody else. And then there’s another horrible table that you’ve got to join to look up and say, who’s the customer and who’s the supplier and who’s an employee. And so to do that, we have a layer that sits right on top of the street, which we call events. And that’s where we model the data. And as I said, in the previous podcast, we only model with three types. There’s a concept, which is a thing, so a customer, or product, or supplier, and employee and order. There’s detail about those things. So the customers name, the customers address, the suppliers store, or warehouse or the order value or the order quantity. And the third one is an event, where we know that customer purchase product, or customer ordered product or customer paid for order, where there’s a combination of concepts, to give her at a point in time to say that had been happened. So within the event layer, that’s where we do the gnarly work to define those core business processes and those core business events in the end to structure the data that way so that we can understand.
Nigel Vining: Yes, and under the covers. The events are quite a fun layer, I guess from a plumbing point of view, it’s where our patterns really get used an anchor the first time because effectively, there’s a lot of stuff happening in events, up until now. Land into history is quite a vanilla data movement, because we’re just picking up the data and moving it through as it is, we really aren’t changing the structure or adding to the structure. Once we go from history to events, that’s when we effectively really break all the data down. And we split out into the individual events. And it’s where our plumbing patterns start to come into play of how we build those event objects, what’s common to them around keys, I guess versioning dates, basically what those tables look like. We use patterns to speed up, how quickly we can break down a rule history table and split out into Value Add Event parts.
Shane Gibson: And we’re leveraging patterns that have been around for a while. So the concept details and the event model. It leverages heavily of pattern called Data Vault as a way of data modeling, that leverages off some event modeling and business analysis modeling that has been defined around who does what. So we’re reusing those patterns that have been around for a while to effectively be able to model that data in a safe way, in a way that matches the organization’s business process and their core business events. Yes, we’ve used this event pattern for a while. We’ve done a number of consulting gigs. But the challenge was every time we did it as a consultants, we had to build it a new technology because the customer always had a slightly different flavor of why they want to store the data, it might be parquet on a file system, it might be redshift or GPTW, they might want to run it on Spark, or they might run on a sequel, or they’re using a database that doesn’t allow you to use store processes. So you have to buy an ETL tool that does all the heavy lifting. And really, what we were doing was re rebuilding those patents time and time again, with different technologies just because the customer decided they wanted technology that had not been done by us at least before. So now what we’ve done is, taken those patents, parten them and made them as a service. So, probably your last year, there was three plumbing and hardening those patterns, to make them as magical as they are now.
Nigel Vining: Yes, that’s exactly. And it was quite an exciting year to actually where we effectively condensed our last 10 years with of building these platforms. And we took those patterns, we whiteboard them, we condensed them down to a dozen or so. And we basically built them up as something we could use over and over again, without having to start from scratch every time. And suddenly the economies of scale as massive, because if you always build an event table the same way, you’re not reinventing the wheel every time, if your layers always exchange these Pac Man between each other, you’re not having to reinvent all those integration wheels every time. And that’s where just the cost and time is massive to build those layers, because there always the complexity in them as a lot. And we figured, why not just spend a couple of months and do each patent, is something you can use over and over again. And it appears to be working quite Well.
Shane Gibson: I think it has, because we haven’t had that option for a while, we’ve spent most of our time working in the rule type. And then some of the other areas, we haven’t actually had to go back and refactor the core patterns of how we create and populate, those concepts, details and events. So for me, there was minimal technical debt and minimal needed to refactor at the moment. So that was pretty cool.
Nigel Vining: Yes, great. And we’ve done that across multiple initial stage customers. And the patterns actually fit, probably a good 95% of what was needed to be done for each of those customers. And there was more, and they were diverse customers, different data types, different inputs, different outputs, but the patterns because they work so well, actually did pretty much everything we need to do, with the exception of writing some additional rules, which we’re always going to have to do anyway, for this specific use cases.
Shane Gibson: Now, one of the problems are that data vault centric modeling, and it’s well known out there is, that you end up with too many tables. You’re an analyst, and you’re going in there. And it really is horrible trying to put the data back together, you got to understand, I got this concept of customer, and then I’ve got all this detail about it. And the detail may be physically stored in different tables for certain reasons. And now we’re going to go put that back together. And then I’ve got to go and hook that up to my event and then go, and that event is a product or whether it was a product concept, and then go grab that and with all of detail about the product, and I don’t write the code. And I don’t know if I could write SQL that did all that together. So for us we want to make that simple. And so what we did was we added one more layer on top, which is our last layer, which we called the “consume layer”. And that’s where, once we’ve defined these concepts, events and details, effectively, we understand what they mean. And because they’ve got business terms, and as modules did, we have patterns that create them automagically. So what we do is, we reverse engineering, that we basically put those events back together into consume, where we can have a table review, that enables you to a as an analyst, as a user to very easily and quickly query that data, without having to do all those horrible joins.
Nigel Vining: Exactly. And if there’s one layer that would be a truly magical AgileData experience, the consumers getting close because as you say, effectively, we undo all the complexity that we’ve done on the canvas to make it simple. And we present back to the users nicely, set up effectively bussiness layer with all the business terms, the tables are nice and clean. And it’s straightforward experience for them. The under the covers to do that, has a little bit of complexity. But it’s where the data vault pattern actually helps us, because we know how all those objects go back together, we can create rules that put them all back together for us relatively easily. As Shane said, we doesn’t write SQL and we don’t expect them to, he just creates a rule that says, I want this, this and this together and effectively has consumed object turns up, it set the correct grain, it’s correctly joined, at the correct point in time. And he can use it in, and use at all, there are no joins, and there’s no complexity in magic.
Shane Gibson: And for me, so what that means is, I can bring some data from Salesforce that has customer in it, and we bring it in, we automatically drop it in a bucket, or maybe go through landing and into history, where it’s stored full time, I’ll then go and create a rule to say, that’s the identifier for our customers. This is a detail about it. And what I given consumers a single table, for customers with all the details, I’ll then go back and I’ll grab some data from another Data Factory, I might grab, maybe something out of the financial system. So where the Payments Live. And again, that data will come in and will come all the way through to history. And then in the event layer, I’ll create a rule that say, this financial system has got this idea of a customer and here’s the ID, the key to the business identifier for those customers. And to find the details about it. And again, then consume get a unique view of those customers. So I can query them. And then I go, “Well, cool”. But having a table of customers from Salesforce and the table customers from my financial system, the bit of a pain in the ass, because I got to put them back together again. So I’ll create a third concept, where I put the two together, if the keys are the same, if the identifier for customer one is one and Salesforce and customer, that customer has the identifier one and my financial system, then my rule is really simple. I don’t need to write any rule logic, I don’t need to apply any change to that data or just automatically given me a single table with all my customers’ data from Salesforce, and from my financial system. But if those identifiers are different, or if I want to de dupe them or do some magic, then we have all types and enable that. But the key being, I then just go into consume. And I have one table that has my customers and all my details from all my data factories. So the same thing happens for products. And more importantly, the same thing happens for events. So if we have an event of customer orders product, when I go into consume, I will see one table that has the customer and all the details, the order and all the details, and the product that were ordered, and all the details, and one large table, what we call a denormalized table. And the reason for that is, I don’t want to have to join, it’s one report my 60 BI visualization tool at that table and make it look beautiful. So one there again, same problem, though, right module, and that’s going to run like crap, isn’t it? We’re going to get tables with 300 columns, with a billion rows, and it’s just going to run like that dog.
Nigel Vining: Well, that’s in the old world that would be the case. And we wouldn’t get away with those patents because they would basically die and would not run. But luckily, because we’re leveraging the power of a columnar database. You can have as many columns as you like, it makes no difference at all, to the performance because columnar databases love columns, funnily enough, lots of rows, again, not a problem, because under the covers, you’re now leveraging what’s generally called at MPP database, massively parallel processing. So the amount of data makes no difference anymore, because under the covers, your query is chopped up into smaller pieces, formed out across a nice big processing array. Results are put back together and given to you. Still, within seconds, you don’t know that it’s gone off and done a round trip, couple 100 hours of elapsed time under the covers, it still gives you the results back in a few seconds. We also leverage at what are we leveraging? We’re still going to have partitioning an event last, so when we build up to consume, it queries are stored optimized, because the store already knows the partitions of data, clustered keys, that helps us but that’s all under the covers, the users don’t see a lot of these technologies, all they know is, that query ran really fast. It’s got lots of rows and lots of columns, but it still only took a few seconds. So I’m happy.
Shane Gibson: And there’s a lot of technology patents under the cover and the technology we use that we haven’t even touched yet. So there are technical things like materialized views, which we haven’t even needed to bring into the way we do the data patents or the technology patents yet. But we will get to stages where we have customers with data that gives us certain scenarios, where we’ll need to refactor some of those consumer areas to optimize for that, one of the decisions we constantly go back and forth on is, should I consume area be physical or virtual? Do we deploy what we call a view and to consume their punches back and just read the data out of events and we never persist it. Or do we persist the data in there because there’s some benefit in doing that. And eventually, where we end up I think is we’ll have a form of optimization within agiledata.io, we’re, depending on the size of the data and the response times and the type of data that’s been surfaced and consume, the way we store it will be different. And there’s true as well as the BI tools, visualization tools. So some visualization tools love big denormalized tables, some of the older tools still like star schemas, and the idea of facts and dumbs, some of the interesting tools still, like relational data models. And so from our point of view, we really don’t care, we can deploy the consume tables or views in any structural format, that suits those BI tools, because we’re effectively taking the data out of the event layer and surfacing it, in a way that makes it easy for the analyst and the user and for the tool itself.
Nigel Vining: Yes, agreed. End layers and patents, give us the flexibility to do any of that stuff, essentially, on the fly as applicable. Consumed doesn’t have to be tables, could be views, could be a mixture, depending on, what that particular role and volumes and columns, it could be a feature based on what tool the end customer wants to use over that layer. And as you say, if it needs to stop, it’s not a drama, because it’s just a rule that says spit out consumers or staff or your tool, it makes no difference on our part, because it’s a patents, if you want to start, it will produce a staff, you want a big wide table to produce that. It’s just a patent, under the covers. That’s the joy of the patents.
Shane Gibson: And a good example of that is when we had to start delivering the data, the consumable data via an API. Again, all we did was, we created a new feature, we created a new PacMan that enabled you to code that data via an API. And the API was effectively cooling the consumer, the consume area. So again, just another patent on top of a non patent, and it was all good. Worked like a thing of magic, thing of beauty.
Nigel Vining: Yes. I forgot IPLS here. Files, API’s, views, tables, you could have any of those difference under the covers, even a chatbot.
Shane Gibson: So in the new data world, there are people saying, you don’t need to actually have layers that you can magically do that. And potentially, we’ll get there where, actually our layers are all in code and not physically storing data that we land on the history. And then every moving part is virtualized as a series of views that’s driven off the rule types. I’m not sure actually that has as much value compared to the cost of us doing that. But it’s something that we can do if we need to. But the key thing for us is, we started off with a set of layers, that we defined, and then we iterated on them. But we’re very clear about what those layers are, what goes on them, what the value was, why we use them. And therefore we make changes. We know which layer we’re talking about. So what Nigel and I are talking about, the next thing we want to build, pretty much the first thing he always does is validate with me. So we’re talking about doing something in the event layer. And I’m like, that’s where I am in, because it gives some clarity between us, in terms of the conference Session of what we’re talking about, given the large number of moving parts we have, to deal with the complexity of data that our customers have and make it so simple and so metrical. So, if you’re doing a data project, and you’re the people you’re working with, have no clarity around the layers, if they say, just put it on the lake, and you’ll be good. I’m not sure, I agree. I’d probably say, get a whiteboard out or a piece of paper, and maybe just sketch out a couple of simple layers that have some value, have some use, and articulate what you can and can’t do within those layers.
Nigel Vining: Yes, I think my final point was, I firmly believe, i always believe that layers actually keep you safe. They provide some layer of isolation going all the way back to Ralph Kimball, in the kitchen analogy, in front of house, layers keep you safe, it allows you to replumb under the covers, and your diners out and your restaurant are oblivious to it, but the data still comes out in the plate. And recently we replumbed the whole event layer for a customer. They had no idea because the consume layer delivered exactly as they expected. We re plumbed it under the covers dates and new features. Isolated, safe, no impact to the end user because it’s on a different layer. Layers keep you safe. That’s what I believe.
Shane Gibson: So, we’ve taken AgileData patent in terms of layering the data through your platform, and applied to the new and beautiful way. It’s like, new Asian inspired beef. Good idea of so. But we’ve taken some old ingredients and we’ve made something that’s modern and new and damn tasty.
PODCAST OUTRO: So, that’s us on layer that was another AgileData podcast from Nigel and Shane. If you want to learn more about how you can apply Agile ways of working to your data, head over to agiledata.io.