AgileData >>> Modern Data Stack
AgileData’s mission is to reduce the complexity of managing data. A large part of modern data complexity is selecting, implementing and maintaining a raft of different technologies to provide your “Modern Data Stack”. At AgileData we have already done the hard work to engineer the AgileData Modern Data Stack, and you get this capability as part of our SaaS solution. This article describes what a Modern Data Stack is and what we use under the covers to deliver this.
What’s in a name
Like all “new” things in the data world, they often come with a new name. In the current wave of data evolution the new name is the “Modern Data Stack”. So what is the the Modern Data Stack I hear you ask?
To describe this concept engineers would typically start using a complex language comprised of three letter acronyms (such as ELT, DBMS, MDM, TDD, GIT etc). But before we dive into that complexity let’s find a less complex description. (don’t worry if you are looking for the technical details from an AgileData perspective it is coming later in this article).
“Before we begin, however, it’s important to understand what exactly we mean by modern data stack:
- It’s cloud-based
- It’s modular and customizable
- It’s best-of-breed first (choosing the best tool for a specific job, versus an all-in-one solution)
- It’s metadata-driven
- It runs on SQL (at least for now)”
Bob Muglia, Barr Moses – What’s in Store for the Future of the Modern Data Stack?
Now while we agree with most of these points, we disagree with the third point. At AgileData we believe the next data wave will be a low-code platform that is an all-in-one solution and does not require you to cobble together a bunch of complex best of breed technologies. So that is what we are building.
But for the rest of the points, hell yes we are onboard with those, in fact they are the basis of the Modern Data Stack that underpins AgileData. So let’s have a look at the Modern Data Stack in more detail.
Modern Data Stack Capabilities, with simplicity
Let’s start off with a simple view of what capabilities a Modern Data Stack needs.
Often called Data Integration or Extract and Load, you need to collect the data from where you enter or capture it and put it in a place where you can combine and use it. This data may be collected from your SaaS applications or bespoke systems.
Often called a Data Warehouse, Data Lake, or Data Storage this is where you store all the data that you need to use, for all time.
A visual way of displaying what data structures should like, or what they do look like. Often includes the ability to define Conceptual and Logical models and turn them into Physical models in the Data Warehouse.
The T (Transformation) in ELT. These are the tools you use to change the way data is stored or behaves. It is the tools you use to create code which augments your data, for example defining metrics or creating new data such as active customers based on a set of business logic.
Orchestration and Monitoring
Often called Scheduling or Observability these are the tools which run your code in a certain order or at a certain time. It is what monitors if code ran successfully or if it failed, notifying you of any problems. It will also notify you if the data that the code updated looked different or weird to what was expected.
Often called a Data Catalog, Data Dictionary, Business Glossary or Knowledge Base this allows your developers, analysts and potentially users to see what data is held, where it lives and what it means.
Often called Visualisation, Business Intelligence, Dashboarding, Reporting, Reverse ETL, Analysis or Last Mile, these are the tools which consume the final data outputs and present them to users or systems in a way that they can use the data. One of the most common is still Dashboards which display KPI’s.
Often called AI, or Machine Learning, these are tools that apply statistical techniques to data. They automate tasks a human would normally be required to do.
A complex picture, sheesh just give me a 100 words
That seems like a lots of capabilities we need for a Modern Data Stack right?
Well it gets more complex, because technologists and vendors like to make things very complex. And so the data vendors have taken each of those 8 capabilities and broken them down into a truck load more categories. Followed by a bunch of new companies creating new software solutions to try and become the “best of breed” leader in one or more of those categories.
Nothing articulates this complexity better than the The Machine Learning, AI and Data (MAD) Landscape which is produced by Matt Turck each year.
“For those who have remarked over the years how insanely busy the chart is, you’ll love our new acronym – Machine learning, Artificial intelligence and Data (MAD) – this is now officially the MAD landscape!”
* for a larger pdf version of the diagram click here
A less complex path
As often happens when people are confronted with a mass of complexity, they strive to make things simpler. When other people hit the same complexity, they often find the proven paths forged by others and adopt them. So it is for the Modern Data Stack, the industry has started to standardise on a set of “best of breed” technologies, often referred to as a “proven path”.
A less complex picture
And this allows us to draw a slightly less complex picture which focusses on the small set of capabilities outlined earlier and a small subset of technologies/vendors which are often cobbled together to deliver these capabilities as the Modern Data Stack.
There are no logos in the Catalog, Data Consumption or Analytics capabilities, as I don’t believe a vendor or product has reached proven path status in those capabilities (yet).
But even though there are less logos in the Proven Path, you still have to cobble those technologies together yourself.
A simple path and picture
As outlined earlier, at AgileData we do not align with the “best of breed” view for the majority of the Modern Data Stack. We aim to remove the complexity of cobbling your own solution together by providing a single SaaS solution, that does what you really need a modern data platform to do.
AgileData Modern Data Stack, with a little more detail
To finish let’s provide a little more detail on the technologies we use under the cover to provide our solution. We know if you have a technical or engineering bent you are hanging out for the details.
We leverage the Google Cloud Platform (GCP) as our underlying cloud infrastructure. We have fixated on keeping infrastructure cost to a minimum to keep our subscriptions at such a low price, and we leverage GCP in some magical ways to do this. We cover the cost of the cloud infrastructure so you don’t have to.
We leverage Meltano to automate the collection of your data into AgileData. We have modified the Meltano open source code to allow us to run it with a serverless pattern, using Google Cloud Build to save costs. We have integrated Meltano within AgileData so it is invisible to our users.
We use Google BigQuery as our primary repository to store your data. We use a number of other Google data services including Google Cloud Storage to provide automation and redundancy, so went don’t have to do things manually.
We have built a unique set of capabilities to allow business and data analysts to transform data using a low code rules based interface, via the AgileData browser based application. To power these capabilities we use a combination of Svelte, Google App Engine, Swagger, Google Cloud Functions, Google Pub/Sub, and Google Spanner.
We have built a unique set of capabilities to allow Business and Data Analysts to model data in a simply magical way. This is provided by the same AgileData application and uses the same technologies outlined for Data Transformation.
Orchestration and Monitoring
We have built a unique set of capabilities to orchestrate the execution of the transformation code/rules using a pub/sub pattern (rather than a DAG /Directed Acyclic Graph pattern). We have also built our own automated monitoring and logging capabilities, as part of our DataOps practises. To power these capabilities we use a combination of Google Pub/Sub, Google Cloud Logging, BigQuery Machine Learning, Google Data Loss Prevention, Google Cloud Armour.
We have built a set of capabilities to allow any user to see what data is held, where it lives and what it means, via our AgileData application. This is the same application and uses the same technologies outlined for Data Transformation.
We do not provide the capability to consume data within AgileData, instead we integrate with most of the best of breed solutions in this space and we are continuously adding integration to more. You can find the latest list at Data Consumption Tools. If you find the list overwhelming we recommend you start with Google Data Studio.
We do not provide analytics capability within AgileData, we suggest you use the Google Cloud Platform for these including BigQuery Machine Learning or Google Vertex AI (those are the analytics tools we use under the covers to power AgileData, so we are confident they integrate well). If you are already using best of breed technologies in this space, reach out and we can confirm how we can provide access to AgileData with that technology.
Plumbing is complex, thats why we do it for you
As part of building out the AgileData solution we have combined a number of individual technologies to deliver data in a simply magical way. That way your team can get on with the work of collecting. combining and presenting data, rather than spending months designing and building your own custom Modern Data Stack.
We are happy to share more details of what we have built and the agile DataOps way of working we use to build, deploy and maintain it. Just ask (but we do reserve the right to keep some of our magic tricks to ourselves).
Keep making data simply magical
AgileData provides both a Software as a Service product and a recommended AgileData Way of Working. We believe you need both to deliver data in a simply magical way.
A Modern Data Stack underpins the AgileData solution.