Myth: using the cloud for your data warehouse is expensive

30 Jan 2023 | AgileData DataOps, Blog, Google Cloud


Cloud Data Platforms promise you the magic of storing your data and unlimited elastic compute for cents.  Is it too good to be true?  Yes AND No.

You can run a cloud platform for a low low cost, but its will take engineering effort  to get there.

Nigel Vining -

Myth: using the cloud for your data warehouse is expensive

True AND False, it depends …..

You don’t have to look very far to find stories from people who have experienced an unexpectedly large bill at the end of the month after trying cloud services for the first time. All those tiny $0.001 charges for services look amazing…

“…look at how much this service costs ! we currently pay $10k/month to store our data on our legacy platform, we can do it for $0.001/gb in the cloud…”

…. but, all those per second/per minute/per month/per gigabyte/per service, they all add up !

If its running, its costing you moolah

The small problem when you start clicking around and starting some of these services, is that they don’t shut down unless you remember to shut them down, or you get your data engineers to setup processes to start and stop them as required. That’s how, at the end of the month, you get a charge for $1,000 for that super fast virtual machine you started up for an hour to crunch some data, then forgot to shut down

We focus on the pennies, every penny counts

We often get asked, how is your monthly subscription pricing so low ?

If you’ve been following our AgileData journey you will know we leverage the Google Cloud platform for our DAaaS (Data architecture as a service), and in particular google’s server-less components to dramatically reduce the cost of running a full featured modern data stack for our customers.

By using server-less components that we only pay for per second when in use, instead of running dedicated compute resources that are charged 24/7, we can bring our pipeline extract/load/transform costs way down.

Coupled with BigQuery which separates compute and storage costs — a game changer ! Data storage is cheap, massive parallel processing compute is expensive …. but not if you’re only paying for it when you use it.

Patterns that are optimised to reduce costs

By customising our ELT patterns and individual table structures based on good practice patterns and source data characteristics, then carefully modelling the consume layer data (what the users query) we can make good use of BigQueries columnar read and write architecture to minimise costs and improve query performance.

Focus on the exceptions

There are only two services we pay for 24/7, and thats a Google Cloud Spanner instance for our configuration database, and a single Google App Engine flex instance which we use as our web-socket notification server.

Everything else is shutdown (scaled to 0 instances) when not in use, hence not incurring costs when our customers are not collecting, changing or querying data.

It just makes sense = good practise

That makes sense doesn’t it ? Why would you run a dedicated Airflow server 24/7 when you only run your jobs for a few hours each day ? or — Why would you pay 24/7 for a database when you only load and query it for a percentage of the day ?

We think cloud services, serverless ones in particular are awesome, and our customers definitely appreciate the lower monthly subscription costs

AgileData reduces complexity in a simply magical way.

We do this by applying automated patterns under the covers that removes the need for you to do that work manually.

Sometimes our patterns are shear magic.  Running a data platform for cents not thousands of dolars is one of those magic tricks.

Keep making data simply magical