Build Data Products Without A Data Team Using AgileData
TD:LR
Late in 2022 I was lucky enough to talk to Tobias Macey on the Data Engineering podcast about our AgileData SaaS product and our focus on enabling analysts to do the data work without having to rely on a team of data engineers. Listen to the episode or read the transcript.
Listen
Read
Summary
Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value.
PODCAST INTRO: Hello and welcome to the “Data Engineering” podcast, the show about modern data management.
Tobias Macey: Your host is Tobias Macey. And today I’m interviewing Shane Gibson about AgileData, a platform that lets you build the data products without all of the overhead of managing a data team. So Shane, can you start by introducing yourself?
Shane Gibson: Hi, I’m Shane Gibson. I’m the Chief Product Officer and co-founder of agiledata.io. And my longtime listener and second time caller.
Tobias Macey: And for folks who didn’t listen to the previous episode you’re on, I’ll add a link to that where we discussed some of the ways that agile methodologies can be applied to data engineering and data management. And so today, we’re focused more on the product aspect of what you’re building. Before we get into that, for folks who didn’t listen to that past episode, if you could just give a refresher on how you first got started working in data?
Shane Gibson: So I started my career out working in financial systems. So in the world before enterprise resource planning, or ERP, when we had separate accounts payables and accounts receivable with type modules, so I was kind of in the systems accounting land, looking after those financial systems. And from there, I jumped across them to vendor land. So started working for some of the big US software companies, but based out of New Zealand. And as part of that, realize that my passion really wasn’t in those ERPs. But I really liked the idea of data and analytics. And back then, it was really cool business intelligence, that’s how old I am. So did that for probably about a decade, and then from there jumped into founding my own consulting company. So typical data and analytics consulting company, 10 to 20 people on the team, and we go out and help customers implement platforms and strategies, and do they call data work. And from there I more often to Agile coaching side. So I found that I had a passion for working with data and analytics teams that were starting their agile journey, and coaching them on ways of working. So applying or helping them apply patterns that I’d seen from other organizations or teams that have worked with particular context, and helping them applied and their way of working and see that, that success of that team really just starting to rocket, enjoy the work, deliver stuff, give value, give feedback. So as part of that we have about three and a half years ago, my co-founder, Nigel, and I started agiledata.io taking a product focus to that kind of capability.
Tobias Macey: Can you give a bit of a high level on what it is that you’re building and some of the ways that that product is manifesting and some of the target audience that is driving the focus of how you’re building it?
Shane Gibson: So we combining software as a service product is the way we think about it, that manages the whole data process from collection through to ready for consumption with those agile ways of working. So we believe that any product should be bound by ways of working you can teach people, that makes sense. So not a methodology as such, but our set of good practices that make sense when you’re doing a certain part of that data supply chain. So that’s our focus of building out both this way of working that’s repeatable and teachable with the product that supports it and makes it easy. In terms of the audience, we kind of break it down now to two audiences. The first audience is the buyer, who would actually buy a product capability. And that really is anybody that’s got a data problem. There’s a bunch of people who are in organizations that know they have data, they can’t get access to it. They just want their data to be turned into information so they can make a business decision and take action. And then from that action gets some outcomes and they struggle to get that done and there’s different reasons for different organizations. Why is that? Why that happens? So that’s the buyer, the user we are hyper focused on a data savvy analyst, might be a data analyst might be a business analyst or somebody who’s data savvy, but actually gets frustrated with the fact that they have to hand off the data work to an engineering team and wait. And it’s not the engineering team’s fault. They just got too much work to do. But the analyst actually wants to get their work done. And so if we can enable them to do that work themselves, given that self-service in that data management space with the guardrails, so it’s not ad-hoc, then we think that we can help those people get that information to the stakeholders quicker, better, and ideally, with a bit more fun.
Tobias Macey: In terms of the types of organizations that are likely to use the AgileData platform, I’m curious if you can give some characteristics of maybe the size or scale or some of the ways that their technical capabilities or engineering teams might be structured, or the skill sets that they might have, or maybe even more notably, the skill sets that they might lack on the team?
Shane Gibson: So currently, what we find is customers that don’t have a data team, the ones that we can help the most. So they are typically somewhere between 2 to 50 or 100 people, they have started either building out their own platform software as a service kind of turtle capability. So they’re a startup, or they are an organization that is using a bunch of different software’s and service products to run the business. And they starting that journey of we’ve got all this data and now need to put it together. When we look out in the market to do that, we now have to hire a bunch of data people, we have to hire a bunch of data engineers, we have to buy a bunch of software or software as a service, we have to cobble together this modern data stack and that’s expensive, and it takes time. So they are currently our customers that we serve. Our focus in the future is really enabling organizations that have those analysts in place, and then have a constraint around the engineering practice. Therefor that whole idea of bringing self-service into that organization, I kind of liken it to the wave we had previously around self-service BI and visualization. , the tableau kind of wave. I call it where there was a constraint of the ability to create reports, we used to use mostly of reporting services or business objects, it was a technical product, there were a bunch of gates to use it, which were there in place, because you had to have competency around a whole lot of things to make that stuff work. And we then saw this wave of self-service capability come in and the visualization space, and we enabled more people in the organization to do good work, we see that same wave starting to happen for data management. And we believe that the same self service capability can be bought to that audience, and enable them to do that work, freeing up the engineers to do the work that they are really, really good at and should be focused at.
Tobias Macey: For teams that maybe have an existing data engineering group or a set of data professionals, what are some of the ways that they might interact with the AgileData platform and some of the maybe support burdens that are alleviated because other stakeholders and analysts in the business can just use AgileData to be able to perform the workflows that might otherwise require the intervention or support of those data engineers or data professionals?
Shane Gibson: Typically, what happens is data engineers like to code, they like to build things out themselves. That’s what they’re trained to do. That’s what they love doing. So we often find if there’s a big data engineering team in an organization, we probably aren’t a good fit for them. Because culturally, they like to build rather than buy or lease. If we take that away, and we look at it, we look at it as there’s a bunch of plumbing that everybody has to do for data. If you think of it as plumbing, you’re thinking of as moving water, there’s a bunch of pipes that you always have to build. And our data engineering teams and organizations stuck with building that plumbing and spending that time day in and day out. So what we look at is if we automate that plumbing for them around, if we take just that pure movement of water from the collection through to ready for consumption, then the engineers can focus on the more fun stuff. Now, one of the areas that they still need to focus on is around data collection. Because when you talk about a big organization, especially one that has a bunch of data engineers, they typically have a myriad of source systems or systems that capture where data goes into. And some of them are easy to get to software’s or service products like Shopify or Salesforce where it’s fairly open to get the data. But 9 times out of 10, there’s also a bunch of proprietary firewalled, either on prem or private cloud capabilities that we can’t connect to. So they saw that data collection work to be able to pick up the data and make it available to be ready to be transformed and consumed. And then if we think about we take care of that plumbing, but then once that water turns up at the end of that pipe, that’s where the real value a data engineer can add or an engineer in general, because engineers are good at solving problems. And so if we think about that water comes out of the tap, or it’s nice and clean as much as it can ever be. Now we have to make that data useful. It’s consumable but it’s not useful. It’s still playing water. So how do the engineers then get involved in the business problems that the organization has around how to use that data? Is that helping the data scientists make better machine learning models, which give better recommendations to their customers, is that focusing on some things like Master Data Management and the complexity of what are the actual rules we need to combine that data because the keys aren’t the same. There’s still 101 problems to be solved. But the problems aren’t moving data from left to. That’s about we take care of so we still see the engineering skill set as having massive value in every organization. But we also see the engineering skill set has been one of the primary constraints for organizations right now to get data work done. And that’s the bit that we want to automate for them.
Tobias Macey: So as far as the design aspects of how you think about building the AgileData platform, given the focus on these analytical workflows, and on organizations that don’t have a large footprint of data engineers, I’m curious how you think about the design and the interfaces that you want to build out to make these potentially very complex workflows, understandable and manageable by people who don’t necessarily want to make that their entire job, they just want to be able to get something done?
Shane Gibson: So we think about it in two ways. Nigel, my co-founder, is the tiki, he’s the plumber, he’s the guy with the experience building these things out over generations. So we think about complete automation and the back end every time. We think about if people are using this product, and we’re not there to watch it happen, how do we make sure it’s bulletproof? And as part of that, we become highly opinionated. So for example, we have baked in the pattern and our product where everything is just arised. Every time a new record comes in, we check to see whether it’s a change. And if it is we store that change, we don’t have a conversation about, do you want any CD one, or two, or five or six type behavior, we’re going to automate it. So every change comes in, it’s stored as a change and that history is immutable. And by being opinionated with that, we take away some of that complexity from the user. Now there’s a hell of a lot of complexity on our side. Because if we’re getting fed snapshot data, we’ve got to do the Delta differences. If we’re getting event streaming, we’ve got to figure out that it was an event change or a new event record. There’s a lot of complexity for us that we need to plumb under the covers, but that’s our job. So that’s the first thing we do is whenever we see a pattern that had complexity when we were consultants and delivering that for a customer, then how do we automate that in the back end and make it as bulletproof as we can? The second lens we then take, which has an inside take us around the product. So how do we get this interface this app? And how do we make it available to an analyst where the building out data stuff that is typically complex, and making it easy for them? So an example around there as data design. So we’re really strong proponents and highly opinionated that data should always be designed, it should be lightly designed. So how do we take this data design processes, data modeling process and helping analysts do it in five minutes rather than take six months? And how do we give them tools that allow them to do it, where they’re not having to do ERD diagrams into relationship diagrams, and we have to teach them about ducks feet, and many too many joins and all those kinds of things. And so what we did was we prototyped, and tested this idea of design canvas, where they can go in and say, I have some core concepts, customer, product, order. I can drag those tiles to say there’s an event customer orders product, and that actually builds a conceptual design in our product. And then okay, now I’ve got a conceptual design, how do I populate it with data? We’ve built a rules engine that uses natural language. So given the status and this history time, say Shopify customer, and we have got these fields, name, date of birth, those kinds of things, they populate this detailed title about their customer. And then when new data comes in, automatically trigger a load and trigger the chase detection. And somebody’s changed the last name, bringing a new record that seems to change on the state that the name change. So effectively, by working with endless we look at some things that we used to do unconsciously as engineers and data specialists. And we try and figure out how we build interfaces that make it really simple for them to do that work without knowing the complexity that lives under the covers. And, to be fair, that’s hard. Every time we do one, you have to sit back and go, as humans, what do we do? What are we looking for? What are the things we look for that are triggers to make a decision that we need to do a piece of data work and therefore, can we automate that so they never have to do it? Or, do we need to prompt them with some decisions? And by making a choice of yes or no, we do that work for them. So they’ll focus how do we bring their interface to an endless where they do the work without knowing they’re doing it?
Tobias Macey: To your point of not forcing the analyst to understand or have to go through the process of educating themselves on these various different types of joins and how to find the appropriate data sources to be able to answer their question. I’m wondering, what are some of the thorny engineering problems that you’ve had to work around to be able to make the end user experience simple.
Shane Gibson: Every one of them thorny, every time we go and do something that’s simple. And we’re hyper focused on removing the effort or removing the clicks. So I’ll give you an example. We had a customer that we were doing a data migration for that was a use case. And we built the ability to bring in some data from something like Shopify lander into what we call our history layer. So that’s the immutable layer that holds all data over all time all changes. We really care about that layer, and then creating the rules that we use to populate a concept. So that’s a customer, that’s a product. We build all that out and I got really good at it. I could go and do a role for that. Normally, about two to three minutes, it was in production. And the time was looking at the history data, understanding where the key was, creating the rule, running the roll, going back and quick look at the data making sure that the test had passed, and the data looked right. So I got it down to three to five minutes. And I was like, that’s great. At the time, we had to write a rule from the lander data to the history data. So I had to manually go and do those rules. And you bring in Shopify data, at least in 25 tables, three minutes. So I’m like, I’m a machine. For this data migration system, though, we had an on-premise SQL Server database we had to bring in and because we’re migrating the data, we had to bring it all in. So they were just over 700 tables. Now, therefore, I had to go and create 700 history rules, two to three minutes a rule. That was a couple of days of my life that I did not enjoy. So we went back and we said, how do we automate that? How do we drop in 700 tables and get those history tables, tiles created automatically without us? And so we went on automated there. And so now that’s what happens. So every time we do something, and it takes us time, when we have to make some really complex decisions, we just go back and think about it. How do we refactor that? How do we make that simply magical but that is hard. Every time we’re doing it in something as an engineer and data, there is a hard problem to be solved and we can’t underestimate that.
Tobias Macey: In terms of the actual implementation of AgileData. I’m curious if you can dig into some of the architectural elements and some of the technical aspects of how you’ve designed the platform to be able to slot into an organization’s existing stack without having to reengineer everything around AgileData?
Shane Gibson: So the first thing we did was we had to pick a cloud platform. And we had to make a decision if we’re going to be multi-cloud or single cloud. And we decided that as a software as a service product, which was our goal, our vision, we wanted to pick a single cloud platform. And in theory, the customer shouldn’t care. Actually, they do. It’s amazing how often the customer cares. But in theory, we’re just a software as a service platform. As consultants, we’ve worked with Microsoft management for a long time. We’ve done what I’d call now the legacy data warehouse cloud platforms, the reach of Microsoft PW, and we knew we didn’t want to use those. We knew that they, they were great at the time, but they had been cloud, Washington, there was a whole lot of problems that we encountered, vacuuming out your redshift tables was always a nightmare. And at the time, this is three and a half years ago, Snowflake really was the up and comer. So we had planned to build our entire platform, our entire product on Snowflake. At the time, I was coaching data and analytics team for an organization and they were starting to build out their platform. And they did a shootout between BigQuery and one of the other vendors. And I’d never seen BigQuery before. So I was lucky enough to sit through that whole process of evaluation. And at the end of it, I said to Nigel, look, I know we got to go with Snowflake, but we probably need to give this BigQuery thing a bit of a bash, because it looks like Snowflake, sounds like snowflake but something weird about that it just seems a bit different. So we did a proof of capability around it. And we found that actually, there were some things on there that we really liked. So we started building off on the query. And we got a whole lot of unintended consequences out of that site of stashes of engineering effort. And what it was, was that by going into BigQuery, we went into the Google Cloud infrastructure into their ecosystem. And there are a bunch of services that are available in there that we keep using, we keep adding new things we need to our platform and the back end, that we leverage Google Cloud Service as for. And the majority of their services are serverless, which means the way we pay for it is amazing and it’s pre integrated. So when we pick up a new service, it just seems to work with all the other Google services. And the last thing is, they’re highly engineered. The amount of engineering thing if they put into their products as amazing. I mean, the marketing sucks as a partner of them. And my experience running a consulting company, the other cloud vendors and the other product companies are so much better to work with. But in terms of Google Cloud, the engineering is amazing. The downside is you have to be an engineer to use it. You have to be a, Nigel, not a, Shane, to configure a platform on Google. But that’s what we do is we build the easy but on top. And so from my point of view, that decision around using Google Cloud was an under tended consequence, and we got massive amount of benefit out of that. If we then think about our platform, we made some really big choices up front. So serverless only, we would only ever use serverless capabilities in this, we had no choice. So no containers, no Kubernetes. And that’s had a massive amount of benefit for us. API, everything, so we have this idea of a conflict database that holds all the core logic for the data that’s fronted by a set of API’s and our actually sits on top of the API. Those API’s are secure, but open. So we’re really intrigued about whether customers will actually start using the API and not the app at some stage. And that’s had some massive benefits for us as well, by decoupling the data from the conflict, the conflict from the API and the API from the app, that kind of standard software engineering for everybody else. But Nigel, and I aren’t software engineers, it’s not our background where we came from the data space. Then the next thing for us was volume that tradeoff between over engineering everything to take a volume of users or a volume of data versus getting something to work and then dealing with the volume problem when it hit us. And it comes around that idea of knowing where your data is knowing that you’re going to have a problem. So for example, when we were bringing on Shopify data for one of our first customers, the numbers were tiny. When we started doing event data from a customer and a cleanroom Use Case, we were getting somewhere between 40 million, or 3 billion row changes or row transformations a day. We had to refactor some of our stuff to a keep the cost down. And be just makes that thing bulletproof. But we knew we had to do it. We just waited until it was a problem. And then we went into hyper engineering mode to solve it. We had already designed the patterns in our heads. We’ve done what we call agile teacher. So we’ve done some pictures and some mirror boards about how we would solve it as a guest. And then once we had that problem, we ripped into saying, will that solve it? Do we need to iterate it? So that idea of incrementally building it out just in time, but still doing Agile teacher still figuring out we’ve got this problem coming up in six months, if we’re successful, what would we do and not backing ourselves into the quarter that we’re refactoring it was horrendous.
Tobias Macey: As far as the overall integration path for using AgileData? What are some of the technology stack that an organization needs to have in place for them to be able to take best advantage of what AgileData provides? And what are the options for being able to extend the set of integrations that AgileData will be able to work with?
Shane Gibson: So if we think about it as left and right. Left being data collection and right being data consumption. So the first thing is we collect the data and we store it. So we’re not a virtualized warehouse. So that’s the first thing for our customers. If they’re uncomfortable us holding the data securely for them, then we’re, we’re not the right solution. And then we looked on the left around data collection. And I thought that was a solved problem when I started this. It’s a solved problem, we don’t need to deal with it and I wrong. So we have built out collection patterns, we call them as we get new customers that have new problems. So there are a set of patterns around software as a service apps. If you’ve got Shopify Xero, QuickBooks, Salesforce, something that has an API, there’s a bunch of pens for going and grabbing that API data and bringing it back in cleating into our history layer. The next one we had was we had a customer that wanted to fall drop. They had a bunch of ad-hoc type data, but it was repeatable if that make sense. So there was no system behind it. But we needed to get their data. And so we built file drop capability. Where they can go and upload the data, CSV or Excel or JSON and then that manually, just file upload, and then we take care of it from there. The next one was we had a customer that wouldn’t let us go in and actually touch their systems. So they wanted to push the data to us and it was event data. So they say we want an effectively demilitarized zone and we’re going gonna connect to you, and we’re going to push the data to you when we feel like it. So we had to build that pattern. Now, how do we actually have automated file drops? How do we do it based on event data? How do we trigger the fact that they turned up? How do we deal with the fact that they’re always going to give us data they’ve already given us? Because no matter how much it’s automated, eventually you get that file again? Or where you get a file with overlapping where half the data we’ve seen before and half we haven’t, how do we engineer for that? Then we had one go save one was on premise. So how do we actually go into an on-premise database and pick that data up and bring it back? And then we had one which was a cleanroom. How do we actually have, in this case, 38 different companies, each one of them using a different collection mechanism to give us the data, and then make sure all of that comes in on a consistent basis every day so we can mash it up and provide that single view? So again, I thought data collection was a solved problem and I was wrong. To service the left and then the right hand side is data consumption. So we don’t do that last mile, we looked at it. And we said, if you look at that last mile space, visualization, dashboards, natural language queries, analytics, machine learning, table stakes to play in that would be three to five years of our lives. And we’re bootstrapping not venture funded. So we’re very focused on where we spend our time. So for us, one of the benefits BigQuery gave us was 9 out of 10 data consumption tools, those last mile tools, talk to BigQuery. So as long as you can talk to BigQuery, you can use the data that we make consumable for you. We have an API layer. So in theory, the last mile tool can consume that data via API. But what intrigues me at the moment is I struggled to find a last mile tool that queries data based on API’s that isn’t a data warehouse patterns, that doesn’t take in that data and then storing it again, to make it consumable. And so I’m really intrigued to see whether that’s going to change whether the market is going to move to have last mile tools that are truly API focused.
Tobias Macey: It’s definitely an interesting space. And there actually has been some growth there with things like the embedded analytics use case, some of the companies that are operating there that come to mind are things like Cube JS, or I think they’ve been renamed to just Cube so that they’re not tied to the kind of JavaScript aspect, I think it’s maybe good data, and side sense, I know are also focused in that embedded analytics case. And then there are projects like tiny bird that build on top of click house, you can use their data as an API, so you can embed it into some product. So definitely an interesting aspect of it as well. And to your point of building on top of BigQuery, I’m curious if you’ve had to construct any sort of custom caching layers for being able to maybe pre-computed, pre-aggregate requests that users are making frequently so that you don’t have to go back to BigQuery every time and pay either the latency costs or pay the extra query costs all over again, because somebody happened to query it two times, five minutes apart.
Shane Gibson: So that’s a really interesting space right now. So we’re waiting for the metric or semantic service that the Looker, ML core at Google, because we need that at some stage soon. But till now, it hasn’t been a problem. So what we do? So effectively, the data comes into our product into the history layer, history layer looks like the source system but it’s time series. All changes over all time and that’s available. We never lose that data. And then we have an event later, which is modeled, it has the idea of concepts, data and events, and that’s in the middle. So it’s a typical three tier architecture. And then our last layer is our consumed layer. And we denormalize it, we actually have big wide tables. Now we can do Star schemas. We can actually dimensionalize the data we just never had to. Now the reason we do those big wide tables is when you actually talk to an analyst, they just want to query a table. They can do joins, but they don’t want to why should they have to? So effectively, we give them big wide tables. And the good thing about BigQuery is that just eats it. In terms of volume, we’ve got some consumable tables here that heading 100,000,200 million rows now. And BigQuery just eats it, the latency of responses a seconds, it’s not sub second. There’s a brilliant analyst out there, and Twitter world called Murmur who’s doing lots of really cool research on his own time around latency and things like DuckDB and those cool things. And if anybody wants to see somebody doing some cool stuff and sharing, go follow him. But we’re not dealing with that sub second response use case. So most people are willing to wait a second or two for that data to come back. And most last mile tools actually have a whole problem of getting the data back and visualizing it. There’s a delay there anyway. What we found is we actually introduced the BI engine into the architecture for us. So with BigQuery, there is this idea. So if you query a table, and that query gets the results, a query gets persisted for 24 hours and you don’t. But if you watch what a user does when they use a dashboard or report, they’re filtering. They’re constantly hitting filters, and therefore those don’t happen. So we’ve introduced BI engine in the middle. And that now keeps our costs down. Because that’s doing effectively in a memory OLAP Cube. It’s rudimentary at the moment in terms of the way it deals with it. But it’s good enough for us right now. Now, as soon as they bring in the Looker ML core, that semantic layer, that metric layer, we’re all over that we’ve done all the agile architecture for it, we’ve done the UI design work for it, we just need that service right now. Because we want that extra layer to give us some benefits, and bring the metric out of the last mile tall back into us. So that’s a core feature that we need to save some of the complexity. Right now BigQuery just eats it. And it really is kind of amazing how lazy you can be with some of the things that we used to worry about 10 years ago where our databases couldn’t handle it, or they were too expensive. That’s one of my learnings for the last couple of years.
Tobias Macey: I brought up the question about kind of the caching and pre aggregates, because number of years ago, I think it was maybe six or seven years ago. At this point, I was involved in a project where I was trying to generate user facing analytics off of data that we were loading into BigQuery. And the multiple second response times were a non-starter for that purpose. So I actually had to pre aggregate the data out of BigQuery into postgrads. So that that could act as sort of like the OLAP layer. At the time, I wasn’t yet experienced enough to really be able to think quite in those terms and design it effectively. But I got it to work.
Shane Gibson: Because we’ve been around to the market for a long time, we remember the OLAP, MOLAP, ROLAP days. And then the horrible disconnect we had between our relational data warehouse and now OLAP Cube and making sure that was synced and refreshed. And the other thing is, as I said, we always do an agile teacher about things we worry about. So we knew we had to worry about performance, we knew we had to worry about cost. And so we have a whole lot of what we called levers, we have a whole lot of patterns that we haven’t implemented yet that we know we will need to at some stage. So things like materialized views. At some stage, we know we’re going to need to bring those in and use them. Early in the Google kind of product lifecycle, they’re fairly basic compared to materialize views that we used to have in the Oracle days or the Teradata equivalent. So we particularly won’t invest in that layer until we really, really need it. Because we get the benefit from Google of all the engineering without our having to build it, the number of times we found a feature for Google turn up just in time that we didn’t have to build it ourselves. It has been amazing. There’s been a couple of examples where there’s some early stuff came out of Google that we really wanted, that didn’t make it into the production. And we’re like, we really want that feature, would be save us money or save us time or be awesome for our audience, users. And then it just didn’t make it right. And so we’ve got to be really careful what we invest in. Another area in terms of education, that we’re keeping a really close eye on, and we’re gonna have to make a decision on at some stages DuckDB, this idea of taking a subset of that data out as a wisdom into the browser. And using that to provide the immediate response to our users in our app, we think that is an area we’ll probably invest in. Because that feedback or being able to see, here’s a piece of data, here’s my rule for transforming it. Here’s the impact of that change without clicking a button and waiting for that query to run. We think that immediate feedback to an analyst in terms of designing the rows that transform data, we think that’ll be magical as well. But we just got to decide when we’re going to build that out, or whether Google’s gonna give us a service that does it for us.
Ad 00:34:07 – 00:34:54
Tobias Macey: Another interesting aspect of what you’re discussing is the question of data modeling for that end user analyst and also some of the ways that you think about the semantics and documentation of the attributes and the columns that are available in the model tables and wondering if you can talk to how you think about that when you’re building this as a service and you just want to be able to say to the analyst pull in the data that you want, write the queries that you want, we’ll worry about the rest.
Shane Gibson: So the first thing is we think about documentation is something that should happen, but it should happen at the right time. And then it should be actionable. So then one of the first set of features we built out was a data catalog. Because we believe data catalog should be embedded in every product. This idea of being able to see what the data looks like, what the profilers have context, in terms of notes against it, see the lineage and the rules. Every data product in the world should have that bundled, and it should be table stakes. You still need an enterprise one, if you’ve got multiple platforms and multiple products, and you want to combine those together. But every product should just do this as table stakes. Looking back, I probably over engineered it, it was a personal one of mine had massive value, but I probably added a whole lot of features that I don’t think we’ll ever get used. And so again, with my product head on, it’s gonna be really interesting to see whether I’m brave enough to remove them and time will tell. But what we know is documentation should be done at the time we’re doing the work because we never go back and redo it. So if I’m creating a rule, and I need some context about what that rule is doing, then we should create documentation. And what we did was, as we built out our rules engine and the interface for it, we did a whole lot of prototypes. I had an idea in my head of what we’re going to build, and it wasn’t what we built. What we actually ended up building was almost Notebooks, almost Python or Jupyter Notebooks as an, there’s a line at the above that has the data coming in, we call that given. So given this data, and then there’s a bunch of N statements, and I do this and that. So he’s just data is coming in. So let’s say I’ve got a table that’s got RD entity in it, a history table that’s got employees, suppliers and customers, and the same type of from the source system, because the software engineers found that the most effective way to build out their system. So in the rules engine, you’ll effectively go in and say okay, given that table and the type equals customer, there’s this filter, and maybe the customer has had a transaction in the last 30 days, and the customer is not deceased, then create a flag called active customer as a piece of detail. And so that idea of actually just adding rows, like a notebook, we found has been particularly successful. And then what we see there’s well doing that, we need to put some notes against it. And why do we? Well, because what happened was I do a piece of work for a customer, I create a rule. And then I go do some work for another customer. And six months later, the first customer asked me to make a change. So I go back into that rule. And I look at it. And it’s very simple to understand, given this and then I know what I’ve done. But I don’t know why. Why am I filtering out that record? It’s got to be a reason that I filter that record out. Like I wouldn’t do it if I didn’t need to because I’m lazy. But what happens if I remove it? So what we found is putting notes at each line means I can do a really simple statement, filter this one out, because there was a problem with the source system six months ago, therefore I’ve got to remove those records, or filter changed this one because for some reason, the tight coding was out of sync for a little while. And the easiest way to clean that up is to do it in line. And so by putting those little notes, I get the documentation I need at the right time to take the action that I want. So what we found is those in place notes are really valuable, like data catalog, thought it had high value, time will tell. But every time we find an action where documentation would have been useful, we then go and figure out how we add that documentation. So one that we’ve got a back end, which is really interesting. For me was we version our rules by default. So it’s effectively get behavior. So if you go in and change a rule, you have no choice, the previous version of that rules stored for you. There’s no get pushy pulley kind of stuff. It’s just in the background every time you modify something, the previous version is stored because there’s good practice. What I don’t have as a box to say why I’ve made that change. So we’re going to add that piece of documentation because I’ll go back in three months and go, I’ve got two versions of the rule. Why did I version that rule. Again, it’s not until you use it and then go for a while you go, that piece of documentation would be highly valuable. I’ll type it in as it gets effectively might get no. And so we have to go back and edit. And so for me, that’s what documentation is about. How do we add a right time that is actionable and is actually going to be used? Otherwise, we’re not going to fill those boxes out and there’ll be blank.
Tobias Macey: And you talked a little bit about the kind of discovery aspect with the data catalog question. And I’m wondering what are some of the benefits that you’ve seen of biasing towards the wide table in terms of the data discovery question and some of the pieces of feedback or challenges that your customers have run into as far as being able to understand what pieces of information are available to be able to factor into their analyses, and then digging more into that semantic aspect of understanding this column looks like it does what I want it to do. How do I understand that it really does?
Shane Gibson: So it’s still about the cool stuff. And then he’s talking about the hard stuff. So one of the cool things that was a surprise to us how often we use it now, as Google introduced them BigQuery, the ability to search for a set of data. So you can go into a table, you can write a query and say, here’s a word. So let’s say the example I use in our demo, I’ve got a table of every car in New Zealand from our Transport Agency. So every time a car is registered, they publish the details of their car. So just 4 or 5 million rows in there for every car. I can go in, and I’m a big fan of minis. And then the Mini World, there’s a race version called John Cooper Works or JCW. So within BigQuery, I can go in and say search this table, this history table of cars for the word JCW, and it will search every column and every row and bring back a subset where it has that scintillators. So we put that into our app because it was a freebie. Well, it’s called demo. Go and search. What we didn’t realize is how often we’ll use them. So an example would be when we first work with a customer, and we’re building out a business rule and that rule is reliant on a piece of data. So filter, or query, this is actually a subset by this or calculated based on this field. And they use a term which is the word in the app, but there’s not the word that’s in the data. Under the covers, engineers have given a completely different name for that field. And so we’ve got to find that piece of data to be able to search. And so what we used to do was say, give us a screenshot, and then we go and look for and then we talk to engineers and say, you see that field there? Or what’s it other than database, what we can do now is say, give us an example of that data. And then we go into the history table, we search for it, and it comes back with those rows, and we go, we think it’s their column that saved us so much time. Another example is validation. Good example, where a customer goes, I’m looking for this order number, blah, blah, blah, and I can’t find it. So we just go into the catalog, and we go into either start at the left or start at the right if you go into the consume table and search for that, it’s not that damn. Let’s go into history, search from. It’s a rule problem. With the data’s turned up, but my rules are excluding it. And then you go into the rule and find the rule and see why that one’s been excluded. So that ability to search actual data has, we’ve used it time and time again, and a ways that we didn’t think about it. So there’s some cool stuff. There has value. One of the problems we still have as the last mile tools, it’s still incredibly difficult to be in a last mile to a dashboard or report an analytical tool and see the context of the data. And we haven’t solved their problem. Like at the moment, you have to go back into the catalog then you have to search for it. Which consume table you’re using. But then you have to go search for it and say, what’s that field and then look up the conflict. Many say, what’s the logical thing? And so that bundling of that process back and forth would be great. Now we’re seeing a lot of the Last Mile tools now start to do that with add-ons to the browser, it’s kind of interesting. It’s almost like the browsers taking care of the linkage between your metadata app, your catalog and your BI tool, your last mile tool. So we’re keeping an eye on that space to see if that’s where we should play, should we open ourselves up to that kind of pattern. If you talk to an analyst, it’s what they want. I’m in this report, I can see a number. Just tell me how you gave it to me? Don’t make me go away somewhere else and find it with 600 of the clicks. But I don’t think that’s a solve problem yet.
Tobias Macey: For people who are interested in being able to adopt the AgileData platform, I’m curious what the onboarding experience looks like. And some of the ways that you have iterated on that process to make it easier for people to be able to adopt and adapt to AgileData as a way of being able to build their analytical use cases?
Shane Gibson: So we’re getting on the journey, we gave ourselves seven years to be an overnight success. And so our first three and a half years have been focused around that buyer solving business person’s problem to get access to their data. And being a technologist, especially having worked for big software companies before where I was doing what was called pre-sales, which was system engineering or customer experience, I think now, where my whole job was to work with the salesperson to present the product to the customer. So they said, yes that looks like it’s going to solve our problem. Can we buy it, please? I naturally want to go into product demos. And so that was one of the mistakes I made really early on was, I was talking to somebody had a business problem and I wanted to show them the cool features we built. And they turn around to me and go, why should I care? That’s not what I’m buying. What I’m buying is you’re solving my data problem. I’ll give you some money. You’ll get my data, you’ll give me some information. I can make some business decisions. I don’t care how you do it, just make it go away. So that whole messaging around that has been really key to us is that actually we’ve got a bundle in both our platform and our services as a fixed monthly fee to just make that problem go away. The second half of our seven years, though, has focused on the endless write about their software as a service. And that’s what we’re working on at the moment is how do we actually onboard them and get them to be able to come on and use the software as a service platform, or product to do the work they need to do without having to be data experts. Now they have to be data savvy. They have to understand how to troubleshoot and what data means, and what a concept for data is, what a customer versus a product is, they have to understand that because all data is complex. But how do we onboard them where they can just do it without us, and that’s what we’re working on right now that’s a hard problem.
Tobias Macey: Absolutely.
Shane Gibson: I look at some of the products I use every day. And I’m like, I don’t have to go on a course. I don’t have to watch YouTube videos, 9 times out of 10. I just go in and use it. So how do we do that for data management? And we haven’t solved that problem yet. But we’re spending a hell of a lot of time drawing.
Tobias Macey: Absolutely. And so in your experience of building this platform and working with your customers and helping to make that challenge of analytical use cases tractable without having to have an entire data team to be able to support those end users, what are some of the most interesting or innovative or unexpected ways that you’ve seen the AgileData platform used?
Shane Gibson: So for us right now, it’s been the use cases that we’ve delivered. So I used to have a job in my previous world that the two projects she never wanted to do was data migration and payroll. Because what would happen is, if you were successful, it just worked. The data was migrated and all turned up, or the payroll got turned on, everybody got paid. And so success was like ah, and everything else was worse. Data didn’t turn up, we lost it, payroll didn’t pay you and people really cared. One of our first projects was data migration project. It was the customer was moving from their legacy on-prem system to a new cloudy thing. They had a vendor partner that has been put into the new cloud software as a service. They’ve been to know how hard it was to do data migration, not because they didn’t understand their problem but because they didn’t understand the core bespoke system that every customer had. And they had to do all that hard work. They knew that it was a money sink, and then also distracted the implementation people who just wanted to focus on new ways of working, get the benefit out of the new system. So what we did was we became the middle person. So if you think about core business process, who does what, customer orders product, customer pays for order, within an organization that never changes unless they fundamentally change and pivot them sources of business, their systems change. So what we did was, we took the spark data, we bought it in, we mapped out the core business processes, customer orders, and product customer makes payment. We then said to the new vendor, are those processes fundamentally changing? No, lots of admin change, status change, workflow change, but those core processes exist. So here they are. How do you want the data we’d like to consume from API? Would it be good if we gave you the API’s that match your input, your scheme of importing migrator data because then we don’t have to map it. We expose that data virus and have API’s that they have. Now really interesting, we had a contract in place with them, which says, whenever you find a problem you’ll tell us about it will reiterate either our rules for the core business processes or the API’s. So you tell us that something’s not right. We’ll go and fix it really quickly, you’re going headed again and tell us we got it right. Naturally, what people do is they want to do it themselves. So they found some problems. So they built some cleanup scripts on their side and didn’t tell us. And then as we’re reconciling things we’re reconciling. Well, we’ve reconciled from source to that consumer API. We know the data matches. But when we go into the app, the last software as a service, the data skewed, we’ve lost records. And then we found out they were cleaning it up, and they were cleaning it up wrong. So we just said, we’re going to go and solve that problem of why you need to clean up, we’ll push it back to us, you just had the API. And by doing that, we saved a whole lot of problems around reconciliation and conversations because we knew where it was breaking and we go fix it. So never going to do a data migration. One of the first things we did was that. The other one is this idea of a data cleanroom. This idea of being data skewed. So customer had 38 different people that needed to give them data. Some of those organizations had data teams, some didn’t. So we needed to combine it and make it safe so that our customer can only see a subset of the information that we’re allowed to see. So some of the larger organizations would filter and clean the data before they seem to us. So we only saw the data our customers want to see. For some of the other providers though it was easier for them just to give us all the data for us to apply those rules. So that’s what we did. They give us every event and then we cut it down, and then our customer can only see the stuff they’re allowed to see. And then you go, now what happens? Well, this is kind of event viewing use case. So some of the providers are using Google Tag Manager, some are using Google Analytics, some are using Adobe, then we had to go into the Insta, Facebook, Twitter, then we had to go into podcast events, then we had to go into connect to TV events. And if you want to see complexity, go and talk to people that actually have video on demand on TVs, and see the problems they have. Because what I didn’t realize was every TV provider, Samsung, Sony uses a different SDK for tracking the app usage on their TVs. And they often use different SDKs for different models of TVs. If you think about the complexity of just capturing who viewed what, when you’re reliant on the device provider, not your app yourself, you’re not in control. There’s a whole lot of complexity there. So that data cleaner was going to be simple. But massive amount of complexity every time we touched a new set of data, which comes back to my problem, that data collection is not a solved problem. I thought it was. I thought five training search data and rocked it and solved it for us. In my experience, every time you get a new data source, there’s a new problem that’s gonna turn up and bite you in the bum.
Tobias Macey: Absolutely. And in your experience of bootstrapping this platform, and building it out and exploring this overall space of building a data analytics back end as a service, what are some of the most interesting or unexpected or challenging lessons that you’ve learned in the process?
Shane Gibson: Often, we do things as humans without thinking about and it looks simple. And when we try and codify it into a platform, it becomes hard. So in the previous podcast, I talked about that idea, the idea of tiered unification. Understanding when a table comes in, or a piece of data comes in where the unique key is to say, that’s a concept. There’s one we’re working on right now, which is really interesting. So it’s around this idea, again, of removing clicks. So I see it when I do something for a customer and our product, and I have to click a few times where it takes me a while. We want to automate that, where I have to think and make a conscious decision. We want the system to make a recommendation on how to deal with it. So as I said, we’re a three tier architecture, data comes into the layer, we then model at lightly, concepts, details and events, and then we make it consumable. And so when you create a concept of detail, you got a concept of customer and some detail about them, we automatically create that consuming table, that big denormalized table on demand. You don’t have to do anything, it just brings back the detail and concept and denormalize is that out and makes it available for query. When you create an event, customer orders product, we automatically go and pick up the customer, the product, the orders, all the details, and denormalize that out to a consumable event table, where you can see every order, every customer and every product that was ordered at that time and that’s all automatic. I don’t have to do anything for that. But I still have to do the rules in the middle. I still have to say for this historical data, that’s a customer, that’s the name. And that feedback loop was a little bit time consuming, because I get some data coming in and I can profile it in the history layer. So within our catalog, you can profile it and see the shape of the data and nulls in there. But I naturally wanted to go and just query it. I wanted to go and do some draggy droppy stuff sometimes to get a little bit more context around it and we didn’t want to build that in the product. So what we’re just finishing off now is when you land a piece of data, you have a choice for auto rolls. And what it will do is it will go and grab that data, we’ll have a guess what the key is and ask you whether it is or not. And if it is, you go yes and then it will build out all the rules in the design stage and create that consumable table for you. Now, we’re not a great fan of source specific data models, we think you should always model your business processes, not the way your system works. And so the idea of conforming concept shows single view of customer, single view of product across multiple systems we think is highly valuable. But doing that big design up front, actually sometimes doesn’t have value. You want to do it later. So this idea of auto rules almost gives us prototyping capability. We think you learn the data and still going to likely model it for you, you slightly designed, you then make it consumable, and then you’re going to use it for a while and then you’re going to go back and refactor. And we think that prototype process, drop it, use it, it’s still likely design, there’s still rules being created so it’s not ad-hoc. And then you’re gonna go and refactor your data design in the future. We think that lifecycle may be one that has value. But as with all things agile, build it quickly, get it out, see what happens. If we stop using it that didn’t have any value and then we should burn it down which is hard.
Tobias Macey: Keying off of that doing it agile statement. One of the hallmarks of Agile is that you do things quickly and iterate rapidly so that you can identify the mistakes early rather than waiting until they become more expensive. And I’m curious, what are some of the most interesting or informative mistakes that you’ve made in the process of building AgileData, and some of the useful lessons that you’ve learned in the process.
Shane Gibson: So there’s been some that we planned that we knew we had some technical debt. So last time I talked about the idea of our conflict being a BigQuery, we knew that we had to iterate that we did the no SQL move to Data Store. Because we thought we wanted to be cool. And we ended up throwing all that work away and refactoring it into Spanner, which is now a foundational piece for us. And we won’t change that unless we really had to. So that was kind of an investment mistake where we knew we had to make a change, and we made a wrong bid and we lost that work and that was okay. This mistakes in the product, these things we’ve built, that I don’t think we ever use. So when we built up the data catalog, I really wanted to review thumb up, thumb down. I thought it would be valuable. We profile the data with those six are available, we have context that you can add against the data about the history of where the data came from, why should care any gotchas. But that review process of the Yelp voting, it’s good or it’s not. We built that out doesn’t take us long. I’m not sure it’s ever going to get used. I mean, we monitor it. So we’ll see if that never gets us, we’ll probably take it off the screen. The next one is we build out a whole lot of local trust rules. So the data quality rules or data test rules. So when you’re building out your change rules, make this into a concept or detail, a screen pops up and says here’s some rules you can apply. It’s not now as unique as a phone number, those kinds of simple data quality tests. And that information is all displayed on the catalog. So you can go into a tile and you can see all the rules that ran the trust rules and whether they passed or whether they failed. So that’s been successful from that point of view, but it’s not actionable. We haven’t built but we actually needed was what happens when it fails. So that’s what we’re going to do now. Now we have a refactor the trust rules or not we don’t know. But we have to solve the next problem. Something failed. Do I really want to have to go and check it to want to be notified of it? If I’m notified of it, and I get 600 failures for different things, is your classification of these ones are important, these ones aren’t? Should I be able to meet them? What’s the actual action I want to take? And so that’s one of the lessons we’ve learned. And it’s a product lesson as much as an agile or a data lesson is actually when this feature turns up, what action do we expect an endless to take by that feature? What’s the value on it? And it took me a while to kind of flip the model to think about that first. But having said that, there’s a bunch of vanity features that are really one and it’s really hard to beat yourself up and say, “No, we’re not going to build them next, because we probably don’t have value for them”. Or what’s the least we can spend on building it and how do we prove that had value or didn’t and if it doesn’t have value, that is a hard human process I found. Because I don’t come from a product background traditionally so it’s all been a good learning for me.
Ad 00:58:07 – 00:59:36
Tobias Macey: For people who are interested in being able to reduce the overhead involved in being able to just ask questions about the data for their business without having to invest in building out a whole team to support it. What are the cases where AgileData is the wrong choice?
Shane Gibson: So for us, it’s real time. We’re definitely not a platform for real time reporting. We can collect data in real time. Because effectively we’re running on Google Cloud. We can use the Pub Sub queueing topic stuff where people can actually stream us records in real time or we can just land on store them into the history layer. We could in theory, take our rules which runs sequentially at the moment. So if you think about it, I think about it as London Underground. Data comes into our history, by default, it turns on what we call auto sync. So as soon as a new record comes up into a history tire, we automatically trigger every dependent rule. And it uses a manifest processor kind of dynamically builds the deck every time rather than have the deck stored. So we clear the config we go. What’s dependent on that history tile, run those rules, what’s dependent on the output of those rules, these things, run the next rules, and we daisy chain through that manifest. So that runs pretty quick. Because, Google just scales like snort but it’s still sequential. It still waits. Now, in theory, we could pick up the entire conflict, and capsulate that into any piece of code as a single piece of code that runs in memory, and every time you record comes up in real time, effectively re-instantiate that code and give a real time score answer at the end. But we’re not going to invest in there. It’s not a big use case for us, it’s a high level of complexity for engineering to get that right. And we find very few use cases where a customer actually wants your number to change in real time at the end of the mile. I look at a number and it’s just changing and so that’s not a use case for us. We could do it, but we’re not going to that’s probably the main one, everything else is. We are really good at it, or we’re going to get good at it as long as it’s a data problem. I suppose the other one is if you don’t want the platform to store the data, if that’s something that you can’t let happen, then were not a product for you. Because we do collect the data and store it in history. It’s a foundational piece of our pattern. We’ve got Google on which is really interesting, which in theory says you can have your data in history or as your blob storage, and we can go and hit that. But we’d have to change a bunch of patterns to say that you hold that history layer which we can do. But what happens if you change it? What happens if you delete it? We lose that immutable record, we can’t replay all our rules and give you the same answer anymore. And for us, that’s an important feature that at any time, you can see what the number you reported was. And we think that’s a cool part of data for what we do.
Tobias Macey: As you continue to build out and iterate on the product and the platform, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you’re excited to dig into?
Shane Gibson: So we kind of look at in two ways. Every time I touch something, and it takes too long, or it’s complex, we’re gonna go and make that puppy simple. And then there’s a bunch of core capabilities or features that we know we haven’t invested yet. So if I look at the whole idea of semantic metrics layer, we’ve done the designs for that, and we’re waiting for the kind of Looker ML service to come out. So we can bind our API and our app with that service layer. So we’d have to build out the execution of those metrics ourselves, but we got to build that out. And so at the moment, when I build a metric for a customer, 9 times out of 10, I’m building on that last mile tool. And we want to move that metric definition back into our product. Almost, that headless BI pattern, because we see that as a high value, we want the machine to do more work. So we want to get more into writing algorithms under the covers that provide recommendations. So we have some simple ones here at the moment, things like data drift, where we run algorithms to tell us that the data looks like it’s skewed, and it’s going to alert us, that looks weird, go look at it, rather than waiting for the customer to say, the data is kind of not what we thought it was. We want to build more of that. So more of those models, but we see them as recommendations. They’re not doing the work. They’re just the machines giving us a hunt that to reduce the work that we do as humans. And the last one that I’d really like to do, but it’s one of the ones that Google removed, was the idea of Natural Language Q&A. So we do research spikes every now and again, we call them expired keys. And I was really keen on this idea a couple of years ago of why we’re not going to build a last mile tool in our product. 9 times out of 10 people just want to answer a data question, how many customers have I got? What products did a customer order? What was the total sales in this region? And so that whole natural language ask a question and get an answer. So we did a mix spiky to see what it would take for us to build that out. And we estimated, we found five really, really smart engineers in that space. And we gave them a year we’d have a really basic capability to ask a question and get an answer and we’re not gonna invest in that. And then Google announced the Q&A service as private beta. So we jumped on that one. And what that was the ability to ask and SQL natural language which question against get an answer. And we hit that out. And it was a thing of beauty was like, that’s actually better than a lot of the commercial products we’ve seen. And then we’re like, why is it good? Well, I didn’t realize that Google Analytics, you can go and ask it a question. And it’s a natural language question, and Google Sheets, you can go and ask for a natural language question. And Google Slides when it’s recommending some style changes, that’s a natural language question. So Google had taken the whole dictionary capability out of Google search, and been testing these dictionaries and these language algorithms, and these other products have these for years. And then they embed this in the Q&A service. So we’re like, that’s cool product, we have a thing. A little chat bot that comes out from the side. And then when you want to do something, it says, you’ve done this, the next thing we normally would see is do this next, or you’re going to do this, you’ve got a choice. For example, if you’re gonna go delete a rule, and there’s a draft rule in play, just want to delete the draft, or do you want to delete the draft and all the previous versions, so effectively remove that rule completely and it’s no longer in production. So it pops off and sees which of these two would you like to do? So t’s really simple. We embed natural language into, we have asked her a question, and she’ll give you an answer. Because the conflict holds all the metadata, we understand what each field means, we understand how it’s used, and we understand the rules. So that was cool. So we backlog that until we had some time, and we made it a priority, come in six months, 12 months later, I’m like let’s go into building a cool feature. Because we just need a bit of a break from the plumbing stuff. Go back to and touch it and it’s gone. And it’s like, reach out to the product team. Was there gone? We’ve turned the data off. So when’s it gonna come out? It’s not. You might see it in the BI tool when the look up, Google Data Studio stuff is where we think it will turn up but it’s not going to be a cool service we can use. And that was like, that saved us years of engineering. I mean, we’re not going to engineering ourselves. I really wanted that. I think and I value. So ups and downs that’s what happens when you’re relying on other people’s technology often to do stuff so you don’t have to.
Tobias Macey: Especially when that someone has Google who has a long track record of releasing really cool things that people like and then saying, now we changed our mind.
Shane Gibson: There’s a definition of agile products is you go and invest in something and then you build it out. And if people don’t use it, then you kill it. That’s hard because somebody’s using it. Not enough, but somebody’s relying on it. That’s kind of go. I mean, they are the epitome of agile product development. But as a consumer, we get to love to hate their behavior sometimes.
Tobias Macey: Absolutely. Are there any other aspects of the work that you’re doing on the AgileData platform, or the overall space of building a back end service for analytics that we didn’t discuss yet that you’d like to cover before we close out the show?
Shane Gibson: I’ve focused a lot about our product. And now we have backend services and front end apps, and it just beautiful to help the analysts do that work. But I’ve probably underpaid a lot around that way of working. That actually, you still have to be data savvy. And I use the word data savvy now on purpose, because I read an article or listen to a blog, Kestrel. Somebody has this really good comment, which is when you tell somebody they’re not literate, actually, you’ve been quite derogatory. Everybody’s literate. It’s about the level of literacy. So as engineers, and as data modelers, there’s a big thing going around LinkedIn at the moment around data modelers, and the fact that we’ve lost the art of modeling. And for those of us that believe modeling has value, we see the loss of it as a big loss and we’re quite negative against it. We need to use words that are polite, and true. So you don’t have to be an expert in data, you just have to be savvy. And so you need to practices or ways of working that help you understand how to take the savviness that you’ve got and applied in the data domain. So that idea of data modeling. The idea of saying, there’s only three things you need to worry about, you have a concept of a thing customer with a product. You has some detail about it. , a customer has a name, product has a SKU, or a name, or a type order has a quantity and a value and a date. They go to give her an event which where we see a relationship, we see a customer order a product, and we want to record the fact that there happened because it’s important to us. If you take that way of working, and you combine that with a product so that they have that in simple, that’s where we think the value is. That’s where we think we can enable the analysts to do the work and engineer the plumbing work, and go back to their. The boring old pipes just move the water. We still need smart people that are really good at problem solving, to do that hardcore engineering analysis problem solving at the end of that pipe. That’s where the value is. And that’s where those constrained resources, those people that don’t have enough time should be focused. We should just automate the plumbing. That’s what we do in our houses. Why don’t we do that with our data?
Tobias Macey: All right. Well, for anybody who wants to get in touch with you and follow along with the work that you’re doing, I’ll have you add your preferred contact information to the show notes. And as the final question, I’d like to get your perspective on what you see as being the biggest gap and the tooling or tech analogy that’s available for data management today.
Shane Gibson: Same answer as the last podcast. I think there’s too much complexity in our platforms and the way we work. And that’s what we’re hyper focused on fixing. I still think there is a lack of machines doing the work for us where they can. The example we’re used last time was finding the unique key on the table. Actually, somebody on LinkedIn replied back and said, the data quality tools, or 20 years ago did it, they found the foreign key relationships and that’s true. But I don’t see it in the modern world again. So we kind of lost that art. So for me, it’s when the machines get smart enough to do the work for us and we don’t notice. We just take the recommendation and go and that made sense. And that’s where the data word needs to get to. So hopefully, that’s the next generation of data platforms are ones that make our toast for us, for our coffee for us that kind of thing.
Tobias Macey: And do it at the right time. So it’s not cold by the time you get to it.
Shane Gibson: Just think about an espresso, we still have to put the water and we still have to put the pot but that’s it. We still pick the size of coffee we want, the strengths, we still have a lot of choice, but the core plumbing that’s taken care of. Well, it’s not really is it because you have to fill up the water thing, you know what I mean?
Tobias Macey: All right. Well, thank you very much for taking the time today to join me and share the work that you’re doing on the AgileData platform and product it’s definitely a very interesting approach that you’ve taken and interesting service you’re providing. So I appreciate all have the time and energy that you and your co-founder are putting into making analytics a set and forget operation as much as possible. So thank you again for that and I hope you enjoy the rest of your day.
Shane Gibson: Thanks for having me on. It’s been great as always.
PODCAST OUTRO: Thank you for listening. Don’t forget to check out our other shows podcast starts in it which covers the Python language, its community in the innovative ways that is being used. And the Machine Learning podcast, which helps you go from idea to production with Machine Learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show. Sign up for the mailing list and read the show notes. And if you’ve learned something or tried out a product from the show, then tell us about it. Email hosts@dataengineeringpodcasts.com with your story. And to help other people find the show, please leave a review on Apple podcasts and just tell your friends and co-workers.
AgileData reduces the complexity of managing data in a simply magical way.
We do this by combining a SaaS platform and proven agile data ways of working.
We love to share both of these, the AgileData Product cost a little coin, but information on our AgileData WoW is free. After all sharing is caring.
Keep making data simply magical