Using a manifest concept to run data pipelines
TD:LR
… you don’t always need to use DAGs to orchestrate
Previously we talked about how we use an ephemeral Serverless architecture based on Google Cloud Functions and Google PubSub Messaging to run our customer data pipelines at agiledata.io
This post I’m going to dive into how we use the concept of a manifest to manage orchestration without an actual orchestration service.
I associate manifests with airlines, so think list of passengers who are supposed to be on a particular flight to a destination, then pivot slightly to a list of jobs that need to run in a particular sequence to complete a data pipeline.
In our example, a manifest is just a list of BigQuery job id’s that we generate automatically when source data turns up, then pass between our cloud functions to keep track of what’s running, the individual job states, and late arriving passengers. Spoiler: we let them board 👍
Without a dedicated orchestration layer like Airflow, it can be quite tricky to build your own pipeline. There are a LOT of moving parts to consider !
Once you have multiple sources refreshing at different frequencies, multiple transformations for each source, multiple targets, and having to keep track of job state and dependencies … its generally easier to just use an existing product, and … a complex DAG is a thing of beauty (at my last gig, we made the graph look like a rocket ship 🚀 )
At its core, agiledata.io is a rules (config) driven platform, we use simple natural language rules for non technical users to define inputs, outputs and transformations.
We can visualise those rules by using a simple left to right lineage map that traces sources through to their targets. Selecting a source on the left, we can see the data pipelines that move data to the right (highlighted)…
We maintain all this config in a Cloud Spanner database. It’s the only service we pay for 24/7, but we think its totally worth it, Spanner is a great product, and the config is the heart of the platform. The rational behind having one persistent product underpinning everything else is that when our app engines or cloud functions startup on demand, they need quick and easy access to config to be able to do stuff, hence Spanner. (I’ll come back in a future article our talk about our journey with Cloud Datastore)
Data pipelines are triggered either by data turning up in a bucket (bucket trigger), a cloud schedule (time trigger), or user initiated (pub-sub trigger) from the GUI to run a pipeline on demand.
The config is queried for that source object and all current upstream objects are returned as a run manifest — for example (from above):
If the Icealicious Orders object has been triggered, six manifests are created which correspond to six BigQuery jobs that are started. Each job has the complete manifest associated with it as an extra parameter
As individual jobs complete, a ‘check job status’ function is called which re-checks the state of every job in the manifest.
Once all the jobs have completed for a particular manifest (Manifest Completed = True) then the upstream object is triggered in exactly the same way, with a manifest being created and a list of jobs started.
There are two useful features we built in, one was optimisation… eg if a table is used in multiple upstream pipelines, we only need to process it once.
The other is late arriving passengers. As i mentioned earlier, the manifest may change during execution, a target table (or rule) that didn’t exist when the pipeline started, may turn up while a manifest is running because another process has created it, so …. we built in a ‘check for late arriving passengers’ 😉 into the pipeline so each time it stops to check run state, it also has a quick check to see if anything has changed in the config and if so, it can include the new object into the run.
The previous behaviour was new objects would only be picked up on the next execution of a pipeline.
Does it work at scale ?
Yes ! It actually does, we now have pipelines moving terabytes of data every day using this pattern and they run flawlessly. We have pipelines with a mixture of large and tiny volumes which happily run, the small ones finish quickly and just keep checking for the long running ones to finish before moving on. We have manifests with two tables in them and manifests with 30 tables in them, it makes no difference.
It’s just a pattern (we love patterns!), data turns up, we check for upstream dependancies, we create a manifest, we start running jobs ….
Next time, Google App Engine … and how we deploy NodeJS and Python to on-demand servers that power the ‘pretty stuff’
AgileData reduces complexity in a simply magical way.
We do this by applying automated patterns under the covers that removes the need for you to do that work manually.
Sometimes our patterns are shear magic. Our dynamic manifests are one example of that magic.
Keep making data simply magical