#AgileDataDiscover weekly wrap No.1

09 Jun 2024 | AgileData Discover, Blog

TL;DR

We are working on something new at AgileData, follow us as we build it in public

Latest update

Shane Gibson - AgileData.io

The backstory that led us to the decision to do a 30-day bet on #agiledatadiscover

#AgileDataDiscover weekly wrap No.1

 

In our first wrap, we set the scene with a few contextual examples that inspired our 30 day bet. 

Legacy data platform migration to AgileData

One of the blockers our #AgileDataNetwork partners hit when talking to organisations about becoming their Fractional Data Team, is the theory of sunk costs. The organisation may have already invested in a data platform and maybe a permanent data team to build that platform and deliver data assets.  Or they may have already paid a data consultancy to build a new modern data platform for them, and then continue to pay the consultancy for the ongoing data work.

Even when our partners offer to reduce the organisation’s annual data costs from $600k a year to $60k, the theory of sunk costs still seems to remain. One of the barriers is the cost / time to rebuild the current data assets and Information Products in the AgileData Platform. So we have always been keen on finding a way to automate the migration of an organisation’s code on our platform. Sounds simple, right? No. We want to do it in a way that doesn’t lose all the secret sauce of what we have built in AgileData that allows our partners to reduce an organisation’s data cost from $600k to $60k.

We’re looking for a way to: 

  • Automate the process of discovering the core data patterns in the organisation’s current data platform, 
  • Map those to the core data patterns we have built into AgileData, then
  • Automagically generate the AgileData config to apply those patterns.

 

Automating data discovery work

From Data Governance to Data Enablement

Earlier this year we were working with an enterprise organisation’s Data Governance team. Well, they were called a team, but in reality it was a team of one. That one person was frickin awesome, but that is a different story for a different time. The organisation was migrating from a legacy data warehouse to a “modern” data platform.

The Data Governance lead was attempting to enable this to happen as quickly as possible, while putting in place the data principles, policies and patterns that were lacking when the legacy data warehouse was created.

(This piece of work was a fascinating one so if you’d like to know more about what we did – jump over to #AgileDataDiscovery for the long version of this story.) TLDR: As you can imagine being a team of one meant that they could not actually do a lot of the work themselves, and by having no team members they had to ask for other data teams to help with the work. 

We decided to flip the model and move from being a Data Governance team of one trying to get the work done, to being a Data Enabler of one, helping the other data teams collaborate together to share and leverage the data principles, policies and patterns they had each created independently.

One area that was a real blocker for the data teams was the ability to document the current state of the data warehouse, especially based on a pattern of Information Products, identifying what was already available, what was being used, what was adding business value, what could be decommissioned and the effort / feasibility to rebuild those that had ongoing value.

There must be a better way to automate this data discovery work.

 

Greenfields Data Warehouse Rebuild

One of the things we have found our customers value, is the speed and cost which we can execute Data Research Spikes using AgileData. We fix the delivery time period to three months and fix the cost for the work.

Often customers already have a data team and data platform; but the team is under the pump with higher priority data work and cannot do the research work anytime soon. The goal of the research work is to reduce uncertainty, or show the art of the possible.  

  • Often a business stakeholder has an idea and wants to know what is feasible, before they decide to invest in it fully.  
  • The Head of Data wants to deliver the research work, but the data team has no spare capacity to do it.  
  • The Head of Data prefers the business stakeholder does not independently engage yet another data consulting company with whom the organisation does not already have a relationship. The reason for this is that external data consultancies often treat the work as a black box delivery – where they provide the answer, but not the workings. This is why organisations engage us, because we do both.

We have just finished one of these Research Spikes.

The organisation’s data team is building a new greenfields data platform to replace their legacy data warehouse. The legacy data warehouse currently provides over 1,000 separate reports all built in a legacy BI tool. The reports and data warehouse all have variable levels of documentation. The team know that a lot of the reports are slight variations on each other, where they have been copied and a filter added, or a different time period hard coded. But the team is stuck with manually reviewing and comparing each report to see what it does, if it can be consolidated or decommissioned.

The research spike we did was to answer this question:

“Can we use a LLM to compare exported report definitions to reduce the manual comparison work required?”

We received a subset of the report definitions as XML.  We received a cluster where the data team had already reviewed them and they were slight variations on each other.  We received a second cluster that was based on a separate set of core business processes.

We then used Google Gemini to compare the report definitions.

The answer to the research question was yes, the LLM approach could reduce the manual work required but could not automate it fully. When we asked the LLM to merge some of the XML report definitions it politely responded with: “I can’t do that. The reports are complex XML documents that define the data source, layout, and other properties of a report. Simply combining three separate report XMLs will result in an invalid structure.”

There must be a better way to automate this data discovery work.

Keep making data simply magical

Follow along on our discovery journey

If you want to follow along as we build in public, check this out on a regular basis, we will add each days update to it so you can watch as we build and learn.