ELT without persisted watermarks ? not a problem
TD:LR
Going back a few years, whenever I was building an ETL pipeline, either manually coding it in PLSQL, or using a drag and drop tool like [insert name of any large legacy database product here] Warehouse Builder — using watermarks was standard practice.
Plenty has been written on the subject, and it is still standard practice today for most tools and practitioners, I won’t go into any detail about the how and why — you can read some of the articles below if you want to.
- https://documentation.matillion.com/docs/2506598
- https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview
AgileData uses a serverless architecture for the ELT pipelines. This was a conscious choices to reduce cost and simplify every aspect of loading and transforming data … we like to say, Simply Magical Data
Traditionally every data warehouse would (and often still do) have something like a ‘control’ table or control schema in the database which contained job and control tables which kept track of what loaded when, job Id’s, watermarks, success/failure flags, start and finish timestamps etc etc
This is a robust, proven pattern that has been used forever.
Full disclosure, I originally architected our ELT pattern using a traditional control table 😆 !!! but after a few months realised that wasn’t going to deliver our Simply Magical Data promise.
Why don’t we use this control table pattern for AgileData? Simply put, we don’t need to, but more importantly, we’ve found that coming up with a new pattern actually allowed analysts to be more agile and deliver new pipelines quicker and safer. We’ve replaced the overhead of using a control table and keeping track of state and watermarks, with a combination of the Logging and BigQuery api to look it up in realtime.
We no longer need to manually track the state of a table, when it was created, when it was updated, which data pipeline last touched it …. all these data points are available by doing a simple call to the logging and bigquery api. Under the covers the google cloud platform is already tracking everything we need … every insert, update, delete, create, load, drop, alter is being captured.
Coming back to watermarks. Since the advent of ‘big data’ , ‘data lakes’ , [insert your favourite industry buzzwords here] appeared, the messaging became …
dump all your data in the swamp, the magical platform will take care of it
That’s all very well, and yes, you can dump and query data without doing any due diligence or transforming it, but this lazy pattern isn’t going to deliver you a robust, transformed, consistent, performant and cost effective data model for your users. For that we still need to apply some modelling and transformation which generally requires keeping track of what has been loaded when and keeping track of cost and performance by only scanning and touching data that has actually changed.
The premise of AgileData is that the users (Analysts) don’t need a Data Engineer to safely onboard new data sources and setup simple natural language rules to transform and deliver consistent outputs. To be able to do this under the covers we needed to handle incremental updates, table partitioning, daily load windows, and full reloads …. all without the user knowing it was happening, while ALSO keeping costs down.
How do we do this — using this simple logic to determine the window timestamp that is applied
If ‘[source table] last loaded [target table]’ > ‘target created’
then use ‘last modified - 1 day’
else use ‘full extract low date’
If ‘pipeline rule last executed’ < ‘target created’
then use ‘full extract low date’
else use ‘last modified - 1 day’
Note — The data points for the source and target table are pulled from the Logging and BigQuery api at runtime by the Cloud Function thats starts the job.
The timestamp that this logic calculates is used to determine the window of source data used in a load/transform job. The timestamp is included as a simple variable in the BigQuery job sql. …
WHERE effective_timestamp > {watermark_timestamp} ...
It is also used for selecting partitions to be replaced (we use partition replace a lot as its a cost effective and performant BigQuery strategy for inserting/updating large tables).
The best part about this logic, is that if a table is dropped (accidentally or intentionally) or a changed pipeline rule requires a full reload of a target, then the logic automatically sets the window timestamp back to a low date, otherwise it works out a trailing ( -1 day from last update) timestamp to slightly overlap new data being processed.
Next time, Google App Engine … and how we deploy NodeJS and Python to on-demand servers that power the ‘pretty stuff’
Get more magical DataOps plumbing tips
Keep making data simply magical