Data observability is not something new, its a set of features every data platform should have to get the data jobs done.
Observability is crucial as you scale
Observability is very on trend right now. It feels like every other influencer in the data space is talking and commenting about it on LinkedIn.
We don’t call it observability in the AgileData Product, its just something we intrinsically do to keep our customers data flowing without me (the Data Engineer) or my co-founder Shane (the Business Analyst) worrying about it.
As a boot strapped startup we don’t have the luxury of a large support team of Data Engineers to monitor the health of the 300+ data pipelines we are currently running each day.
When we started out with a single customer it was easy for me to monitor their daily ELT’s manually … you know how it goes, run a few SQL scripts, check the logs occasionally, easy.
Then we got another customer, and another …. pretty soon monitoring became a headache, it wasn’t a 5min daily check anymore, it became 30+ mins pretty quickly.
And as soon as Shane started asking, “did x load this morning” or “the volume of new orders for customer y doesn’t look right”, I fell into every Data Engineers nightmare … spending a large chunk of the day checking load pipelines, data volumes, data anomalies, data timeliness … instead of developing new features.
Observe what is important
Luckily, the things I worry about (or in this case, don’t want to worry about) nicely align with what end users generally expect:
- the data is refreshed on time (meets an SLA)
- the data volume is within expected boundaries (isn’t an anomaly)
- the attributes that are important to them are validated (eg every customer has a date of birth, and if they don’t, they get a notification)
For the first two, we built these in as auto-magic features, and the third we let users specify which attributes are important to them and these get monitored using a wrapper around the Dataplex Data Quality feature.
Leveraging observability patterns
The magic. By running a simple BigQuery ML time series model over our event logs we can work out the expected SLA and data volumes for every table. Basically the 30min window of time every table should have refreshed by each day/week/month and the lower and upper load volume we expect.
If either of these two checks fails then we create a notification for further investigation.
The best thing about the above pattern is that I don’t have to do anything!
When a user onboards a new data source and creates their consume report, the first time data flows through the pipeline a default SLA and load boundary is created. Every subsequent load after that continually refines the SLA checkpoint and the load boundary further.
Visualise whats important
Keeping it simple. All I want to know is how late a refresh is, and by what magnitude a load volume is under or over by. As Shane likes to say “just show me a happy or sad face” . Thats why, if all tables refresh within their SLA and no load anomalies are detected, we show a happy face, and if not …
Drilling into a flagged anomaly (below), we can see the load history, timings and volumes for a table — for this example (GA4 sessions), two anomalies were flagged over the holiday period.
We can see at a glance that users weren’t online over new the years period hence sessions dropped below our lower expected boundary, then when users returned to work a few days later traffic spiked past the upper boundary as they caught up on browsing.
Next time I’ll delve into our data quality patterns (what we call trust rules).
Get more magical DataOps plumbing tips
Keep making data simply magical