The language of data is not so natural
TD:LR
The dream is we can just point the machine at our data, ask our question and get a useful answer.
With ChatGPT we are closer than we have ever been, but we are not there yet,
When Nigel and I first started development of AgileData we had a vision in our heads, one of removing the complexity of doing the data work in a simply magical way.
The nirvana for this would be what I call the magic sorting hat, a hat tip to Harry Potter. The concept being you could just drop your data into a tool, ask it a question and get a relevant answer that helps you achieve the action or outcome you were after.
There are a bunch of tools and technologies that provide a version of this capability, but as soon they engage with complex data they tend to fail. And in our experience most organisations’ data is complex.
McSpikey’s
From day one we have focussed on developing AgileData using a number of small bets (some would call it an agile way of working ;-), and trying to reduce the uncertainty of those bets via research iterations, which we call McSpikey’s.
NLP Data Rules
One of the first McSpikey’s we did was to see if we could make the core of our product based on a Natural Language Rules (NLP) engine.
We had already proven in our previous lives as consultants that the Gherkin language had massive value in assisting teams write natural language data validation rules, aka data tests. We had also worked with teams that had used the Gherkin language as a way of defining the requirements or business rules required to transform data. For example defining the rule for an “Active Customer”.
One of the first bets we wanted to place was could we build a tool that used natural language as the interface to transform the data.
The talented Nigel wrote a quick and dirty interface we could experiment with.
This interface allowed us to write a transformation rule using the Gherkin language. This rule got parsed and stored in “Config”, which can be thought of as a metadata repository. The Config was then converted to SQL on demand and the SQL executed to transform the data when needed.
What we found was that while the process worked, the user experience didnt. The user had to have a lot of knowledge of the SQL language to write the natural language, and our target user was a person who didnt have that knowledge.
As with all McSpikeys we learnt a lot, and some of the patterns we experimented with are still in play today.
For example the core pattern of rule > config > SQL execution is still one of the foundational patterns of the AgileData product.
Google Data QnA
A little later on in our AgileData journey, Google Cloud released a private alpha of a Data QnA service.
Data QnA was a new feature inside BigQuery, that allowed natural language queries to be written and executed on data stored within BiqQuery. It effectively translated the natural language statement to a SQL query.
We signed up for the Alpha and did another McSpikey.
This one was focussed not on using NLP to create the rules but using NLP for the consumer to be able to ask a plain english question and get an answer from the data. It is a feature tools like Microsoft PowerBI and Thoughtspot have had for a while. And while we don’t plan to deliver the last mile tools as part of AgileData we thought the ability to be able to answer questions in the AgileData Web App via ADI might be valuable.
As part of the McSpikey, we did some initial UI prototypes as well.
The Google Cloud Data QnA was a beast!
Google had trained the models on years of data from Google Analytics, Google Sheets, Google Slides etc where the NLP Q&A capability was already available.
Because we already design data as part of the AgileData way of working, we already classify the business context of the data, i.e it is about a Customer, a Product or an Order etc. This mean’t we could provide hints to the QnA services and get awesome responses that were aware of the data context.
We estimated to build a poor copy of the Google Cloud QnA engine would take at least 5 talented engineers at least 12 months and that would be if everything went ok (which we all know it wouldn’t). And even then it would still be a “half fat” version, if that.
Unfortunately Google Cloud follow a similar bet process to us (well to be fair we more like we follow a process similar to them), and the Data QnA service did not make the cut so went no further. Although we believe the QnA capability might appear in the Looker suite of services someday, so we are never saying never.
ChatGPT
And that leads us onto the OpenAI ChatGPT.
We did a quick McSpikey to see how that might help us in the use of NLP.
We have been working on the new features we need to allow Data Magicians to define metrics in AgileData, so we used the metrics concept as the input.
We asked it to convert a Gherkin statement for a metric into SQL.
We asked it to convert the same Gherkin statement for a metric into Python.
We asked it to convert the SQL Statement back to Gherkin.
In hindsite we should have given it a different SQL statement, not reused the one it gave us earlier.
And finally we asked it to convert SQL to a natural language.
The outcome of the McSpikey was ChatGPT would allow us to create a “Babble Fish” and convert languages back and forward. Which may be useful for both the creation of our natural language change rules by a Data Magician and for the ability for consumers to ask natural language questions for the data.
We would need to do a lot more testing on the accuracy of the language conversion and how we could provide better context of the data, like we did with the Google Cloud Data QnA service.
Also we would need to wait until OpenAI have disclosed their planned business model for the ChatGPT service.
One Step Closer
So one step closer to the nirvana of the magic data sorting hat, but no smoking cigar.
Keep making data simply magical
AgileData.io provides both a Software as a Service product and a recommended AgileData Way of Working. We believe you need both to deliver data in a simply magical way.
A Modern Data Stack underpins the AgileData.io solution.