One of the (many) things we needed to decide when we started to build out the AgileData.io MVP, was which capabilities we would build vs which capabilities we would buy or lease.
We have a defined principle of lease first. Our definition of lease means we are using a serverless solution, where we do not need to maintain the infrastructure or software, and we only pay for the the capacity we actually use
Our definition of buy is where there is money required upfront, or there is a fixed cost to use the solution and this is charged if we use it or not.
Open source technologies provided an interesting challenge when trying to decide if we should categorise them as a buy or a build. We made a decision that using any technology that required us to deploy and manage the technology using a “container” would be classed as buying rather than building.
The infrastructure decision was a no brainer, we were never going to build and maintain our own hardware, so choosing one of the large cloud providers was a given. As I have already written about we choose GCP, Why we chose Google Cloud as the infrastructure platform for AgileData.io.
We have a serverless first principle in our GCP architecture, so I would categories our infrastructure approach as lease.
Again a no brainer for us, we use a number of GCP serverless storage options, so this one is a lease.
Combining and changing data
We had a vision of “magic happens here” for combining data, so we knew we would need to build those capabilities ourselves rather than buying an orchestration, data pipeline, ETL tool or service. So this is definitely a build.
We are leveraging a number of open source patterns that we have found, but we are running these under a serverless architecture , rather than in a container, so this is a build for us.
We looked at leveraging the GCP Data Catalog service (so would have been lease), but we found it functionally immature for what we need, and we believe the Catalog user experience is a key thing for us to nail, so this one is a build.
We leverage a combo of rst, sphinx GCP Repo and GCS for this. We attempted to deploying using a serverless pattern but got stuck at the last hurdle so we use a deploy and destroy container pattern for this, so given our definition it is a buy (but really bordering on a build).
We use a combination of GCP services where they fit our need (so lease) and build the rest of what we need around those services.
For things like orchestration we looked at a number of open source tools such as airflow and the GCP flavours of these such as Data Composer. But we believe the “flow” based orchestration patterns are a legacy pattern and we wanted to implemented more of a “messaging” based pattern. So we have gone with a build, leveraging as many GCP serverless (lease) capabilities as we can. One of the unintended benefits of this pattern has been the cost benefits compared to one of the container based patterns.
Deciding on our approach for collecting data was a little more difficult.
We knew from our years of experience that given the wide range of source systems out there we could spend all our time building out the different collection patterns needed. Dealing with the different patterns for on-prem databases, vs flat files, vs SaaS applications, and the patterns to deal with full extracts, vs deltas and change data. Then add in all each unique variation of API each SaaS application seems to have and the outlier data (embedded unicodes in note fields anyone), we could spend years dealing with those.
We looked at the big ugly buy options such as Attunity Replicate, Oracle Golden Gate etc. We have worked with those products before and knew buying one of them and then “cloud washing” it was not a viable option.
We looked at other services that provided this capability as a service, for example StitchData, Fivetran and Boomi. They provided a raft of data collection adapters to deal with the infinite variations in SaaS applications and also provided their capabilities “as a service”, providing a lease option.
While they would quickly solve our technical challenge, we were worried about some of the other implications that would come with the decision to lease one of these options:
- We would have to provide their legal terms to our customers to sign and this introduced some complexity for the customer;
- We had no real visibility of where our customers data would be located and potentially persisted, increase the treat vector for this data;
- When we scaled our customers and their data, our costs would scale linearly, and we wouldn’t be able to iterate and adapt our data collection pattern to optimise these costs;
- There was always a risk that the provider would sell out and a core capability in our solution would be at risk.
So for these reasons we decided not to lease.
We were lucky enough to find some open source patterns we could leverage that reduced the effort required to build the data collection capability out. Again like our documentation capability we ended up with a deploy and destroy pattern for these and so given our buy vs build vs lease definitions we are currently using a buy pattern for data collection.
Leveraging our agile approach we are only adding adapters for new source applications as a customer needs them. In the future we are hoping Google’s acquisitions such as Alooma and Looker or a future acquisition will provide data collection as a GCP service and then we can refactor our capability to move to a lease pattern.