Buy, Build or Lease

22 May 2020 | AgileData Journey, Blog, Product Management

One of the (many) things we needed to decide when we started to build out the AgileData Minimal Magical Product (MMP), was which capabilities we would build vs which capabilities we would lease or buy.

As part of our AgileData Way of working we have a defined a principle of lease first.

Our definition of lease means we are using a serverless solution, where we do not need to maintain the infrastructure or software, and we only pay for the the capacity we actually use.

Our definition of buy is where there is money required upfront, or there is a fixed cost to use the solution and this is charged if we use it or not.

Our definition of build is where we put in the majority of the effort to create the capability, rather than spending our money on something that has already been created and spending our time integrating it.

Containers

We made a decision that using any technology that required us to deploy and manage the technology using a “container” would be classed as buy rather than lease.  We have to either run the containers full time or we have to build the DataOps processes to stop and start them.  With serverless we get this stop / start, pay as you use pattern as part of the service, with no effort required on our side.

Open Source

Open source technologies provided an interesting challenge when trying to decide if we should categorise them as a buy or a build. We made a decision that using any technology that required us to create and manage the “container” templates would be classed as build rather than buy.

Infrastructure

The infrastructure decision was a no brainer, we were never going to build and maintain our own hardware, so choosing one of the large cloud providers was a given. As we have already written about why we chose Google Cloud, Why we chose Google Cloud as the infrastructure platform for AgileData.

We have a serverless first principle in our architecture, so our infrastructure approach is lease.

Data Storage

Again a no brainer for us, we use a number of Google Cloud serverless storage options (Google Cloud Storage, Spanner, BigQuery), so this one is lease.

Combining and changing data

We had a vision of “magic happens here” for combining data, so we knew we would need to build those capabilities ourselves rather than buying an orchestration, data pipeline, ETL tool or service. So this is definitely build.

Profiling data

We are leveraging a number of open source patterns that we have found to profile data, so this is a build for us.  We are running these under a serverless pattern, leveraging Google Cloud Function rather than containers.

Trust Rules

We are leveraging a number of open source patterns to define and execute our trust rules. But we leverage Google Cloud Dataplex to execute our trust rules, so this is a lease for us.

Catalog

We looked at leveraging the Google Cloud Data Catalog service (so would have been lease), but we found it functionally immature for what we need, and we believe the Catalog user experience is a key thing for us to nail, so this one is a build.

Documentation

We leverage a combo of rst, sphinx Google Cloud Repo and Google Cloud Storage for this, given our definition on open source this is a build. We attempted to deploying using a serverless pattern but got stuck at the last hurdle so we use a deploy and destroy container pattern.

Managing things

We use a combination of Google Cloud services where they fit our need (so lease) and build the rest of what we need around those services.

For things like orchestration we looked at a number of open source tools such as airflow and the Google Cloud flavours of these such as Data Composer. But we believe the “flow” based orchestration patterns are a legacy pattern and we wanted to implemented more of a “messaging” based pattern. So we have gone with a build, leveraging as many Google Cloud serverless (lease) capabilities as we can. One of the unintended benefits of this pattern has been the cost benefits compared to one of the container based patterns.

Collecting Data

Deciding on our approach for collecting data was a little more difficult.

We knew from our years of experience that given the wide range of source systems out there we could spend all our time building out the different collection patterns needed. Dealing with the different patterns for on-prem databases, vs flat files, vs SaaS applications, and the patterns to deal with full extracts, vs deltas and change data. Then add in all each unique variation of API each SaaS application seems to have and the outlier data (embedded unicodes in note fields anyone), we could spend years dealing with those.

We looked at the big ugly buy options such as Attunity Replicate, Oracle Golden Gate etc. We have worked with those products before and knew buying one of them and then “cloud washing” it was not a viable option.

We looked at other services that provided this capability as a service, for example StitchData, Fivetran and Boomi. They provided a raft of data collection adapters to deal with the infinite variations in SaaS applications and also provided their capabilities “as a service”, providing a lease option.

While they would quickly solve our technical challenge, we were worried about some of the other implications that would come with the decision to lease one of these options:

  • We would have to provide their legal terms to our customers to sign and this introduced some complexity for the customer;
  • We had no real visibility of where our customers data would be located and potentially persisted, increase the treat vector for this data;
  • When we scaled our customers and their data, our costs would scale linearly, and we wouldn’t be able to iterate and adapt our data collection pattern to optimise these costs;
  • There was always a risk that the provider would sell out and a core capability in our solution would be at risk.

So for these reasons we decided not to lease.

We were lucky enough to find some open source patterns we could leverage that reduced the effort required to build the data collection capability out. Again like our documentation capability we ended up with a deploy and destroy pattern for these and so given our buy vs build vs lease and open source definitions we are currently using a build pattern for data collection.

Leveraging our agile approach we are only adding adapters for new source applications as a customer needs them. In the future we are hoping Google’s acquisitions such as Alooma and Looker or a future acquisition will provide data collection as a Google Cloud service and then we can refactor our capability to move to a lease pattern.