“Serverless” Data Processing
Is serverless even a thing, and how do apples on a conveyor belt relate to data ?
Serverless is actually a bit of a misnomer, there are still servers, but they’re abstracted away and you don’t loose any sleep worrying about them, they just work, and you only pay for the computing seconds you use.
This pattern suited us perfectly, we’re a bootstrapped startup where monthly costs are important and paying for resources that we weren’t using wasn’t an option.
Now, the apples ….
Apples are a reasonable analogy for processing data. They’re seasonal, they turn up at the processing factory one at a time or by the truck load. They need to be graded for anomalies, and they follow different processing pathways before they get to the end consumer.
Our data processing conveyor belt is a series of Cloud Functions, small python functions, short runtime, and stateless. Here’s a screen shot showing some of the worker functions, you’ll notice they’re all message triggered, I’ll explain why below.
As data (one or many) lands at an ingestion point the first cloud function picks it up and moves it along then sends a message saying “I’m done, I’ve moved a batch of data … here’s some information you will need to move this data further”
This message triggers the next cloud function, it takes the message payload which contains some metadata about the data being currently processed, and it starts up, runs, sends a message when it’s finished and shuts down.
This hand off from function to function continues until the data has moved all the way along the conveyor and has got to the consume layer in the customers reporting tool… back to the apples analogy, delivered to the customer ready to eat !
What !!?? thats too simplistic, cloud functions only run for a few minutes and have limited memory so how can you process terabytes of data and handle long running data processing steps !!??
Remember the cloud functions are just the conveyor, we can use other serverless components to actually process the data. In this case we hand off all the hard work to Bigquery, which is a petabyte scale database with a HUGE data processing capability.
The cloud function calls the Bigquery API and passes it information to run a job (usually SQL). The function doesn’t need to wait for the job to finish, because that could take an hour. So what we do is start the job, and send a message with the job id to our ‘check job status’ function and shut down the original cloud function.
The check job status function will keep checking every minute (by sending itself a message) until the job completes, then instruct the next function to pickup the processed data and move it further along the conveyor.
Using this pattern we can simultaneously start any number of conveyors, some run fast and only process a few records, some slow processing terabytes of data, some are started by batched data arriving, some by streaming, and some on a schedule.
As an engineer i don’t have to worry about when, and how how much data turns up, because a) the functions autoscale as required (ie start multiple instances) and b) because they talk to each other using a guaranteed messaging infrastructure, so i know they wont get lost mid process.
…. next time i’ll talk about “manifests” , a pattern which let us automagically determine the order of data processing, and how it can change on the fly.
One of the AgileData architecture principles is Serverless by default.
This allows us to deliver our platform and our product at a price that will amaze you.
Keep making data simply magical