The challenge of parsing files from the wild
In this post we discuss how we handle the challenges of parsing files from the wild.
We liken our data processing to a water system where each file is a new source of water.
To ensure clean and well-structured data, each file goes through several checks and processes, similar to a water treatment plant.
These steps include checking for previously seen files, looking for matching schema files, queuing the file, and parsing it.
If a file fails to load, we have procedures in place to retry loading or notify errors for later resolution.
This rigorous data processing ensures smooth and efficient data flow, akin to well-managed water in a plumbing system.
Setting the Pipes: Why We Need to Tame the Files from the Wild
Imagine, if you will, a world of plumbing where the water rushing through the pipes isn’t just H2O, but data.
We have our own set of pipes, valves, and filters designed to handle the flow of data instead of water. And just like the plumbing in your home, our data plumbing system faces similar challenges: handling the ebb and flow of data, managing the pressure, and ensuring the right data gets to the right places without causing any leaks or floods.
One of the many challenge is parsing files from the wild, akin to dealing with an unpredictable water source,
Navigating the Pipeline: Our Approach to Handling Wild Files
Just as a plumber checks the water source and adjusts the plumbing system accordingly, we too conduct several checks and processes when a new file lands into AgileData.
First off, we check if we have seen this file before. It’s like testing the water to see if it’s drinkable. If we have seen the file and it has been successfully loaded in the last 5 minutes, we reject it. This is a safety valve, mainly to prevent a stuck filedrop loop, where a customer might keep dropping the same file continuously. We get lots of notifications, but we ensure that there are no upstream issues.
Next up, we look for a matching schema file, much like checking the pH level of the water. If the file has been loaded successfully before, we will have a persisted schema in the filedrop bucket, which we use to load the new data.
Then, we queue the file, creating a queue entry for when it arrived and where it’s going to be loaded to, and move it into a separate bucket until we are ready to de-queue it and load it. It’s like storing the water in a tank until it’s needed.
We then peek into the queue after 5 seconds to see if the user has dropped multiple files. If no more files have been dropped, or the new files are for different targets, we automatically de-queue the first file and start processing it.
Next, we parse the first 5 rows of the de-queued file to identify the file type, header row, and delimiter. This is our quality check to ensure the data is what we expect it to be.
If this is the first time we’ve encountered the file, we use AUTODETECT, which automatically parses the file and determines the data types and column names, and loads the file into a new table in BigQuery.
If we’ve seen it before, we use the existing schema file and append the data into the existing target table.
Ensuring a Steady Flow: The Results of Our Plumbing Strategy
Just like a seasoned plumber, we always keep an eye on our system for any potential hiccups.
On completion of a successful load, we once again check the queue to see if more files for this target have arrived, and if so, we de-queue the next one, otherwise, if this was the one and only, or the last in the queue, we trigger the next upstream process.
Should a file load fail, we have our own set of wrenches and pliers to tackle the problem.
If it was a prototype load (no config exists, the first time we have seen this file), we retry the file load after creating a new schema for it using all string data types.
If it’s a contract load (configuration exists), we create an error notification and move the file into the error bucket for later resolution.
Dealing with data is like dealing with water, it requires understanding, patience, and the right set of tools. In our world of data plumbing, we ensure that data, like water, flows smoothly and reaches its destination without causing any disruptions.
Get more magical DataOps plumbing tips
Keep making data simply magical