“Serverless” Data Processing

01 Jul 2022 | AgileData DataOps, Blog, Google Cloud

TD:LR

When we dreamed up AgileData and started white-boarding ideas around architecture, one of the patterns we were adamant that we would leverage, would be Serverless.

This posts explains why we were adamant and what that has to do with Apples.

Nigel Vining - AgileData.io

When we were laying the groundwork for AgileData, brainstorming architectural designs, we were dead set on one thing – Serverless had to be a key component.

Yes, Serverless may sound like a fantasy, but it’s not. There are indeed servers, but they are cleverly abstracted away so you won’t lose a wink of sleep over them. The beauty is they just work, and you only pay for the computing time you consume.

This model was a perfect fit for us, being a bootstrapped startup, where every penny counts and squandering resources was never an option.

Imagine data processing like an assembly line for apples. They’re harvested seasonally, arriving at the processing factory individually or by the truckload. They’re inspected for irregularities, and each follows a unique route before reaching the end consumer.

Our data processing assembly line consists of a series of Cloud Functions – petite Python functions with short runtimes and no state. As you can see from the screenshot below, all these worker functions are message-triggered. 

AgileData

When data (a single item or a batch) lands at an ingestion point, the first cloud function grabs it, processes it, and sends a message saying, “Job’s done. I’ve processed a batch of data. Here’s some info you’ll need to continue the journey.”

This message sets off the next cloud function. It absorbs the message payload containing metadata about the currently processed data, gets to work, and upon completion, sends a message and powers down.

This relay continues, function passing the baton to function, until the data traverses the entire assembly line and lands in the customer’s reporting tool. It’s like delivering the apple, ready to be enjoyed!

Now, you might be thinking, “Wait a minute! Cloud functions have short runtimes and limited memory. How on earth can they process terabytes of data or handle lengthy data processing tasks?”

Well, remember, cloud functions are only the conveyor belt. We can employ other serverless components to do the heavy lifting. In our case, we delegate the hard work to BigQuery, a petabyte-scale database with immense data processing capacity.

A cloud function contacts the BigQuery API, hands it the instructions to run a job (usually SQL), and doesn’t stick around for the result because that could take a while. Instead, we kick off the job, send a message with the job ID to our ‘check job status’ function, and shut down the original cloud function.

The ‘check job status’ function keeps checking every minute (by messaging itself) until the job’s done, then it signals the next function to pick up the processed data and continue its journey along the assembly line.

With this pattern, we can launch any number of assembly lines simultaneously. Some run fast, processing only a few records; some move slowly, handling terabytes of data. Some are triggered by batch data arrivals, others by streaming, and some run on a schedule.

As an engineer, I don’t have to fret about when and how much data shows up because a) the functions autoscale as needed, starting multiple instances, and b) they communicate using a reliable messaging infrastructure, ensuring nothing gets lost mid-process.

Next time, we’ll dive into “manifests”, a nifty pattern that allows us to automatically determine the order of data processing and adjust it on the fly. Stay tuned!

Get more magical DataOps plumbing tips

Keep making data simply magical