AgileData Patterns

Oct 7, 2020 | AgileData Podcast, Podcast

Join Shane and Nigel as they discuss what agile data patterns are and how you can apply them to speed up the delivery of data to your stakeholders.

Guests

Nigel Vining
Shane Gibson

Resources

Recommended Books

Podcast Transcript

Read along you will

PODCAST INTRO: Welcome to the “AgileData” podcast where Shane and Nigel discuss the techniques they use to bring an Agile way of working to the data world in a simply magical way.

Shane Gibson: Welcome to the AgileData podcast. I’m Shane Gibson.

Nigel Vining: And I’m Nigel Vining.

Shane Gibson: So today Nigel and I are going to have a little chat about this thing we call patterns. And for me, patterns is one of the core things we’ve kind of developed in our consulting gigs over the last 20 years. And it’s something that we leverage a lot in the way we deliver the magic within the AgileData product. So for me, a pattern really is something we know we can apply in a particular problem space or a particular use case that will solve that problem quicker and fast rather than building something from scratch. So often, we can apply a pattern to the wrong thing and make it worse. But when we apply the right pattern to the right problem, we know that problem will be solved in a certain way. And that takes us less time, less effort and less risk to fix that problem. So Nigel, what’s your view of a patterns? How would you describe it?

Nigel Vining: So for me, patterns is probably waste. I think in terms of my consulting toolbox as it were in my toolbox, I know a whole lot of different ways to make something work, rightly or wrongly, whenever a customer asked me to deliver a piece of functionality or a feature, I automatically go to my toolbox. And I identify a pattern that I’ve used before successfully, and I will apply that to their problem to achieve the desired outcome.

Shane Gibson: An example for me might be the file drop stuff that we built. So we have a pattern where somebody can grant a CSV file or JSON file, they drop it in a place. When we see that turned up, we do something with it, we move it into history in our case. So that’s a reusable pattern. We know that allows users or analysts to upload the data quickly, and it turns out what we need it. So once we build that pattern, we lock it, we load it. We make it hardened. And then we know that the question of, I have this data and it’s manual, how do I get it there as taken care of?

Nigel Vining: I agree with that. The file drop is a really good example. And in the fall dropped patterns evolved over time, we started with that pattern was quite rudimentary. Then as we went along, we finessed it and finessed it. Now, it’s a very hardened piece of logic. And we know exactly what it does every time and it’s repeatable and there’s no effort on our part to apply that.

Shane Gibson: Probably another example uses the data modeling techniques we use. So in the old days, we would have done star schemas, we would have done slowly changing dimensions type two, they were patterns that we got from our friend Kimball and allowed us to say we have some data, we need to structure it in a certain way to make it useful. And that dimensional modeling pattern was the one that we use. These days we use in terms of AgileData, we use more of a hybrid of a database model under the covers for a whole raft of reasons but again, it’s a pattern. So for us, we talk about concepts, details, and events. So there are concepts which relate to things that we may want to count or manage. So products, customers, employees, stores, orders, payments, they’re all concepts of things we need to manage. There’s some detail about those concepts. So it might be person’s name, or a person’s age, or the product quantity, or the order value, or the payment amounts, or the employee’s location. And then the third thing we care about is the pattern of an event. So when we go a customer orders product, we know there’s a certain pattern that we can use to describe that event and reuse it. So for me, modeling in the data world is a bunch of patterns. I suppose one of the things we’ve seen is when the big data bollocks, we lost those patterns. We started getting big data scientists that had no understanding of patterns. And so they would just write a raft of Blackbox code that had no repeatability in the way it was written, no repeatability in the way the data modeling was done, and chaos reigns as it tends to do in this space.

Nigel Vining: Yeah, if we talking about the event modeling, that’s quite nice pattern for us. Because under the covers, it’s got really nice structure to it. So when we be actual coding and DevOps data that goes with modeling those events, is quite a nice repeatable piece of plumbing for us to write because we can write a pattern that deploys those models, and then basically reuse it no matter what we throw at it, whether we’re throwing customer data, product data at it, event data. The model automatically adjusts because it’s a pattern. I quite like that one personally.

Shane Gibson: And I think also the culture we have a rules and the way we use natural language as a way of describing our ETL effectively or writing it, or ELT. It’s a form of a pattern. So we find a problem in some data, we determine the pattern we can use to fix that problem. And then we make that pattern available as a rule type, so that any analyst can quickly apply that pattern to data. So an example I had the other day was some of the customer data we’re working with, we needed to validate our cities. So as we always know, data has been entered into a source system, or what we call a Data Factory, it gets manufactured at various levels of quality. So the state of that was in this field called ‘City’ was particularly unclean, and it was fraught. So what we did was we created a reference dataset of all the cities in the world. And then we created a rule type where you could compare the data in the customer table to that reference data set. And then we’ll come back and flag the ones that didn’t match. And then you could take some action, you can fix in the source system, or you could determine some rules where you wanted to clean it up on the way through. So we’ve been baked that in as a rule pattern. So next time you want to do that you say validated cities, what was interesting is because New Zealand’s address structure is kind of funky. A lot of the data that was in the City field for the customer was actually around suburban location. So what we did is we extended the pattern for the validation out, we went and grabbed a list of all addresses out of New Zealand and bought them as reference data set. We tagged each of those types of address information such as Suburban’s locality as a bunch of features. And then we updated that rule type to say validate city but as well as validating against the list of cities from the world also include suburbs and localities out of that New Zealand reference data set. So again, but it worked for us to bake that pattern. But next time I do it, click on the button, I can say validated against world cities, or extend the validation out to these this other reference data set. So again, I’ve got a problem, I have a reusable pattern for solving that problem, I can apply that pattern in less than a minute and get the value from it.

Nigel Vining: Let’s just try to think of an analogy, that’s a good one. Build once, reuse many, and that’s where in my consulting background patterns are nice, because you can quickly apply them to the problem at hand to solve something.

Shane Gibson: Yeah, I remember really the early days in the consulting market in New Zealand, there were a bunch of experts. And it was back in the days where we used to use diskettes. So three and a half inch discs, which we used to carry around before USB drives were a real thing. And there were a bunch of people that had some code on those diskettes that they effectively kept in the top pocket. And so if you had a problem and you knew the person had a pattern to apply to that problem, you had to hire them, and they bring you a little just get the code that fix the problem for you. There was always an argument at the end about the customer own the fixing for that. And so actually, the accessibility of those patterns was pretty limited. And we’ve seen a lot with open source now. Things like airflow and those kinds of tools where a lot of the patterns have been shared. And so for us, what we want to do is remove the complexity of applying those patterns and make it a one click magical experience. So have this problem click a button, it solves it for me.

Nigel Vining: I have to agree with it when I used to carry that. It was a CD at the time and on it was literally a CD that was filled with SQL scripts and stored procedure scripts for Oracle and at each gig you would effectively eject your CD in, grab your patterns out and change the names of them and apply them for the customer’s environments. As Shane said, patterns are now accessible to everyone. A lot of them are open source, people freely share them. It’s very democratized how patterns get used now.

Shane Gibson: Also the idea that every time we touch a piece of customer data, every time we work with a new customer, we get given a new problem, we don’t have a pattern for one we haven’t actually made into a rule type. So that’s what’s really cool as you go. Actually, there’s 101 problems we could solve. And actually, if we guess what the priority is, we’ll probably pick the ones that may or may not be common. So there’s a bunch of patterns that we have applied from day one, because we knew they had value. But now it’s a case of every time we strike a new problem with customers and their data, we look for the patterns that we can make reusable to everybody else. I think it’s also important to say that patterns aren’t only around technology and code. So we have this concept of an information product, which is a pattern of how we can engage with business owners or business stakeholders, and very quickly understand the core business questions that they want answered. So how many products have I sold? How long does it take to go from selling it to get paid, things like that? And then also the events that make up there, which is the data that we need to be able to answer those questions. And so that is a pattern we use repeatedly when we’re talking to people to understand what they want. But it’s not code, it’s not technology, it’s just a way of solving the problem, which is how do I stuck out on the stakeholders brain, the things that are really important to deliver first, because from an Agile point of view, we want to deliver the highest value piece of work first.

Nigel Vining: It’s a people pattern that it’s effectively, it’s not code, but it’s a standard list of questions like a checklist that we work through with new customers to effectively get the things we need to know up front and take them on the journey to elicit the responses that we know we’re going to need to get from them.

Shane Gibson: One of the other things is to look at what you’re talking about patterns is the inappropriate use of patterns. So one hammer test for every nail, because it is tempting, when you have a certain way of working or a certain pattern that can solve many problems. And whenever you see a problem, it’s like, can I use that pattern on day one rather than go to all the effort of investing in a new pattern. So you gotta be a little bit careful about. Again, data lakes as Microsoft example, data lakes have value, the ability to store structured and unstructured information quickly in one place to be able to use it as a pattern that has value. But the idea of using a data lake is the only way of managing your data is just ridiculous. We’ve been through this phase of one pattern to rule them all. So now we’re moving into the land of hybrids, I was watching and listening to one of the excellent webinars from Eckerson, he’s one of my favorite analysts companies in terms of providing insight into where the markets going and what’s happening. And they were talking about a data lake house. So the combination of using a data lake to get data landed very quickly. And then moving back into some of the more govern patterns among data warehouse to fix the problem of ungoverned code, and ungoverned data and lots of people writing their own stuff, we’re using their own patterns, and then stakeholders getting 25 different answers to the same question. So we got to be a little bit careful of over using patterns where they don’t fit.

Nigel Vining: That’s an interesting point, actually, that almost what came to mind was an anti-pattern for me. Recently, an analyst I was working with, he has background in a traditional relational database, and he’d come up with some beautiful code to produce an answer. And then what he had done is, he taken his pattern and applied it to a modern, massively parallel database centers. It just literally didn’t work because his paradigm was wrong as code was great, but it was never going to run on this new platform, because it wasn’t equivalent what would call an anti-pattern. It didn’t take much to effectively explain to him why it didn’t work and how he should adjust it, and he did. And it worked fine. It flew. He got his results in literally minutes, whereas previously it ran for over an hour to achieve nothing. So that’s where you could say a pattern that’s in one environment becomes an anti–pattern in another.

Shane Gibson: And for me, I kind of got to the stage now where I talk about technique. So from an AgileData point of view I really like the analogy of cooking and food and restaurants and the skill and doing that. So if we talk about, you have some technology, and that’s really the utensils and equipment that you have in a kitchen, and you have a bunch of ingredients, which is typically the data that we have to play with to produce a meal or produce an answer to a question, then our ability to use techniques with both those ingredients, and those tools or equipment, that is the combination that gives us the value. So if I had cast on pot, and I was trying to make a bunch of middle sauce, it’s going to be very different than if I’ve got a walk, or an aluminium pot or a barbecue. So if I had excellent technique, I probably could make that sauce and a walk, which I don’t. But I probably need to move towards one of the tools or the equipment that’s more fit for purpose. And with the amount of new technology and variability we have in the data space these days are supplying the technique that we knew worked for us in the past with new technology is sometimes independent. So I’ll take the example, you take it back to that person. He had great skills, he had great technique for certain types of databases. But when we apply that technique to the new types of databases, like BigQuery, it doesn’t optimize it or run, but it won’t run well. It’ll be good. It’ll be a source, but it won’t be a beautiful smooth source. I think that’s important is understanding the combination of your skills, your techniques, reusing those patterns, reusing that equipment to get the best outcome.

Nigel Vining: I like that analogy, that’s very appropriate.

Shane Gibson: And so for me, what I think about now in terms of those techniques and patterns is how do we teach them? How do we coach them? So we have a level of maturity of those things. So if I can explain it to somebody else, and they can use that pattern and display that pattern using a technique the same way I do, then that’s a really hardened pattern. It’s mature, I can teach it. Before that, I’m probably requiring an expert who understands all the complexity and can take what I’m telling them and apply it. And before that, we’re starting to get down not a novice, but somewhere in the middle there, where somebody could potentially following that way. And then ideally, we’ve got a novice. If we can teach a novice their technique, or let them leverage that technique. So going back to their city, if a novice could say that columns called ‘City’ apply this rule, tell me what’s wrong. That’s really nirvana have been able to empower analysts to apply those techniques in a magical way, which is ideally what we’re after. So one of the things I struggle though with is, if you have somebody that’s at an expertise or a coaching level, they often don’t want to adopt other people’s patterns or techniques and that’s a really interesting experience. Would a hardened data engineer use our platform to change their data because we are faster than they will ever be? And so far, in the experience we’ve done and the answer’s no.

Nigel Vining: I think a lot of people fall back on what they know. And what they know is they have their own toolkit, and they grab their code out, like your floppy disk example. And they want to write their ETL, ELT’s, or their transformations, and they want to write them by hand, because it’s what they’ve always done. They know it works. They know it takes a long time. They know the testing. It’s complicated, but it’s a comfortable known pattern. I think even though you can remove all that complexity, and using a robust under the covers of agiledata.io, people are nervous to deviate from what they know.

Shane Gibson: Also, because it changes the way they get paid.

Nigel Vining: Yes, you’re right, because you’re effectively saying I can now do in five minutes produce all of your transformational logic or your patterns, I can produce them in five minutes using this template, or you can spend a weekend here encode them. So it is undermining how you get paid per hour.

Shane Gibson: And so when we look at where technology’s gone on the data marketplace, we see we’ve moved away from so suites of products back to best of breed silos. So if we want to look at how we collect data out of those data factories, we need some patterns and techniques which require certain technology stack, where we want to land it or put it down somewhere so we can use it. We need some more techniques and patterns. So another technology stack, where we want to load it into some form of data storage that can change the way that data looks. We need another one. And then we want to write code that actually makes those changes, we need another one. When we want the catalog to users can understand what we hold and what it means we need another one. My view is over the next five to seven years, people will get upset or had enough of cobbling all their best of breed back together. And we’ll move back to sweets, because we’ve always seem to go in a seven year cycle. So it’ll be interesting to see how much when that happens, whether this concept of a pattern becomes more prominent, or whether it’s again, bake your own sweets, or within the best of breed.

Nigel Vining: I agree the next five, six years will be interesting, there’s a certain level of dissatisfaction with the amount of integration that seems to be required for a number of the new best of breed platforms, because they are literally just a platform. And there’s a lot of integration required to make them do smart things.

Shane Gibson: And again, it’s just a pattern problem. So we know that dropping a file into Google Storage, and then moving it into BigQuery, which is what we do, we have a bunch of patterns that do that for us. So we don’t have to care. If we wanted to dropping data into an s3 bucket and AWS, and patterns bring that into BigQuery. Again, it’s slightly different technique, but it’s the same pattern. So if we wanted to, we could apply a pattern model to allow us to integrate these best of breed products, but that’s not how the marketplace seems to think. It always seems to rely on consulting companies and technical people to spend weeks, days, years, to do that integration. And in retirements of the spike thing. So ideally, the market will move to more patterns. One of the things I still struggle with is, how do we write patterns down in a way somebody else can read them, understand how to apply that pattern and which techniques they need, and more importantly, understand when to apply them and when not to. So there’s something that I think we probably need to work on a bit more as explaining those patterns, because the patterns are valuable. And we have no problem sharing those patterns. We just struggle with ways of being able to do it without me talking and waving my hands. So hopefully we’ll get rid of that in the future and we’ll publish more of our patterns out so we can add more value to other people.

Nigel Vining: I look forward to that.

Shane Gibson: Alright. Well, that’s Nigel Vining talking about patterns, we’ll catch you next time.

Nigel Vining: Thanks Shane.

PODCAST OUTRO: And that data magician was another AgileData podcast from Nigel and Shane. If you want to learn more about how you can apply Agile ways of working to your data, head over to agiledata.io.