Some tasks seem really small and only take minutes, but multiply that effort by completing that task a hundred times and you have found a task that should be automated.
Collect your data
In AgileData.io we automate the collection of your data from your data factories (the places you create new data, often called systems of record or source systems). Whether its a SaaS data factory, an on-premise or private cloud data factory or a series of Excel files, we want to remove the complexity of collecting that data for you.
Corporate Data Memory
Once we have collected that data we move it to your History area in AgileData.io. Each time we collect new data we rack and stack it in History area for you, so you can always see what that data looked like last week, last month or last year. The History area stores all your data for all time, which we think of as the corporate memory of your data.
The Data Domain is Complex
In the data domain there are a number of complex techniques that we can apply to make this process more efficient, and in true technical behaviour we often give those techniques “three letter acronyms” (TLA’s) to ensure only the technical and data literate of us can understand how these work. You will hear terms such as CDC, SCD Type 2, Snapshots, Deltas and Temporal.
To apply these techniques we need identify the thing that will tell us every time when data has changed and let us know we should keep that change. Sound easy, just compare the records that are collected against the records we have already collected right?
But one of the complexity we need to deal with is that sometimes we will collect the same data more than once, i.e we have collected it before and already stored it in History and would really prefer to not store it again.
The Key is the finding the Key
So the technique we apply is to identify the key of the data we collect and then use that key to identify if the data we collected this time has been seen before.
For example your data factory may have a field called CustomerId that is unique for every customer. When we collect some details about the customer, say Name and Address, we compare the Name and Address for that Customer ID to the data we already hold for that Customer ID in the History Area. If it looks different we store the new version, if it looks the same we ignore it as we already have it.
Data Factories are designed to capture data, not combine or present it
Now some data factories are designed under covers in a way that is very efficient for the capture of this data, but is a little difficult to find a unique Customer ID Key. Again we use technical data terms like foreign keys, role playing tables and composite keys to describe some of the patterns that can be used in the data factories to define a unique Customer ID.
So when we collect a new set of data from a current data factory or a new data factory we have some manual work to do to identify and confirm the Key for that data.
4 minutes is fast
We have got the process of identifying the Key and loading a new set of data into the History Area in AgileData.io down to approximately 4 minutes effort.
This includes profiling the collected data, determining the key, confirming the key is unique, creating the rule that will populate the History Table, executing the code that loads it and validating the History Table has all the required data.
When we applied this process against our first customers data factory, which was Shopify, there was the equivalent of 8 History Tables to configure rules for. So 4 min x8 meant 1/2 hour elapsed effort and we were done, great.
As an aside the way Shopify provides access to their data is via an Application Programming Interface (API = another TLA) and the API returns data that is nested and is also data based on an event (order, payment etc). So each Shopify History Table is the equivalent of multiple tables in other data factories. While the creation of History Tables for Shopify was quick, the complexity is moved to configuring the rules that identify the Concepts, Details and Events we need to be able to use the data to make decisions, but that is a subject for another post.
4 minutes is fast, unless you do it 100 hundred times
Our second customer was delivering a Data Migration use case, where their legacy data factory had 380 data tables we needed to collect.
So this became 4 min x 380 which equates to around 25 hours of elapsed effort. Not so quick and also not so much fun, I would characterise that work as “Dross”.
Automate the Dross, in a agile way
So here is a perfect candidate for a repetitive task that should be automated.
As I mentioned earlier each data factory can use a different modelling technique to create and manage unique keys for a customer record, so there is complexity we need to manage when automating this process.
Our agile mindset coupled with this known complexity means we know we shouldn’t try and build a solution that covers every data factory scenario in the first go.
So we will focus on automating an obvious data factory pattern, aiming for 60% of data that is collected being moved to the History Area with less than 30 seconds of elapsed effort.
That will take a pile of Dross and make it Simply Magical.