Data Lineage, mapping your way to magic

Nov 19, 2020 | AgileData Podcast, Podcast

Join Shane and Nigel as they discuss data lineage, what it is, why people want it and what it can actually be used for.

Recommended Books

Podcast Transcript

Read along you will

Welcome to the agile data podcast where Shane and Nigel. The techniques they use to bring an HR way of working into the data world in a simply magical way.

Welcome to the agile data podcast. I’m Shane Gibson and I’m Nigel Vining. And so they, we thought we’d have a little chat about a thing called lineage. Um, so linear just been around for a long, long time in the world of data. Um, I remember some of the early ETL tools that we played with or used an anger 20 odd years ago.

It must’ve been, um, yeah, one of the things they were always pumping out was their ability to do lineage. So, you know, it’s been around for a wall. It kind of goes out of favor. It comes back in new tools, come out. They don’t have to start building it. And if we talk about what lineages. Um, the way I explain it is, uh, I use a visualization or a vision, uh, thing in my head, which is a London underground, the, the tube map of the London underground.

And it gives you the ability to see that if you start at one station, uh, there’s a journey you can take through a number of other stations to get to. And lineage for data for me is the same thing is that I have some data in a, in a source system and a data factory, uh, and that data passes through all these bits of code to end up in my report.

And I want to be able to understand each of those steps in the journey of the data takes for some specific reason. There’s some action I want to take by understanding how that data moves. Um, so for me, that’s what lineage is or. Uh, yeah, I guess my mind similar. And it’s funny that you mentioned tolls and 20 years, but, um, I always, I guess I go to something like the Oracle suite at the time, Oracle data integrator, all those tools and lineage was effectively your, uh, left to right diagrams.

How you built up ETL and data pipelines. You’d start from the left. We’re using a source object. You’d a. Grab a target for the object on the bride. And you’d Chuck some widgets between them. And it was a fig definitely your journey. Uh, the data integrated would say, take the data from the left, go through the widgets in the middle would spit out something on the right.

And it was you a lineage and it was what we’d call relief to. Right. And it’s funny cause that’s still actually what we call. Today live to right how we get there. And then you can add the complexity because they knew that you’re right. Becomes another lift and keeps going and going and going, like get you’ve met analogy.

Yeah. I think, um, our article died undergrad. I mean, you know, that’s only 10 years ago. I mean, you you’re you’re you’ve been around long enough that it was a warehouse builder or actually, actually it was a hand-coded appealed sequel in your day. Wasn’t it? Yeah, you’re right. I was, it was warehouse.

So, um, yeah. What are the challenges with all those tools in the old days was, um, uh, black-box code. Um, so even when we use those tools, uh, we still had nodes sitting in the tool. Bunch of codes, uh, where, you know, the developer or what I call a data engineer. Now couldn’t use the pointy clicky features of the tool for some reason.

And so I ended up writing a blob of code to do some make that was relatively complete. And typically as soon as they happen, it broke the lineage. You know, it was as if, when you got to the station, you know, between the station and the next station, there was like a tunnel. And when you got through it, you never, you could never tell which way the train went because.

Um, and I’m not sure with the latest versions of the lineage tools I’ve seen where they solve that problem. Uh, because you know, with those tools, you’re still writing a bunch of black box. Um, so for us, I think one of the things that makes it slightly different, we have an edge, all data, the IO. When we talk about lineages, you can’t actually write what’s code.

Um, you have to create a rule using a rule type. Uh, that’s no one piece of conflict. We understand the technique or the pattern around that. And therefore we can always make. Um, there’s some things that you can’t do right now, um, because they’re too complex for something that we’ve designed a technique for.

Um, but that’s the trade-off right. Is providing that simplicity. You lose some of the complexity until we figure it out. Yeah. The, um, yeah, the black box. Yeah, you’re right by the black box. It’s a waste the sticking point. And I just recently, as I’m helping a customer and they were moving between. Products, I guess you call it ETL products.

And one of the must have features for the new platform. Product was, um, lineage and. First thing, I think pretty early on, they realized that all of the legacy code at least good, 95% of it was embedded in black boxes. So there is no lineage when you have a black box, because it’s not at a left to a right, uh, through a transformation, it’s a lift tool.

Right. But the code in the middle could act. Bringing in five more tables producing some TeamSTEPPS a whole lot of transformation. So the relationship between left and right, um, you know, is, is nothing like it would look like on a Lydia witchcraft. You’ve lost all of the complaint, stale the transformations, and actually what’s happening in that black box and shine.

You’re right. Effectively, the only. Well, I guess the organic way to do lineages, you need to keep your building blocks, um, small and transparent. Everything is a left and a right where the transformation, um, because it’s, you know, the only way you can really get a very strong lineage map out the other side of it, because all of the non touchpoints, um, I keep very busy.

Yeah, no. So I suppose if we look at, um, you know, the black box code, again, it depends on the technologies that people were using. So, you know, if the data pipelines or the transformations that you’re writing a purely SQL based. Um, in theory, right? There’s a structure there there’s, you know, uh, some data going in some joining stuff, you might be creating some team titles, but they’re all based on SQL.

So that’s a pattern that you probably can create lineage from. But I, I remember when I was, uh, Doing projects, where there were a lot of sass people involved. They had this great habit of using SAS macros, and they were really powerful. But what it meant was in the middle of your code, you could go call something else.

I think these days, that kind of sounded, it sounds like an API. Really you’d go and call this. It would do all a lot of stuff and give you back the answer. And so the benefit of that was you could write those bits of code once and reuse them time and time again. But within the lineage, you know, the code that you’re looking at, um, is really calling something that’s outside of its control outside something that has no visibility to.

So those techniques, again, make it even more difficult to get lineage from those types of tools. Right. Yeah. Yeah, my gray. And I’m your comment rant. And I guess passing say cool inside a black box or any type of code is it’s. Um, you can do it with some sixties, um, to a certain level, but given that the pair of Seiko and, um, And I like to use it towards banks ability.

You can effectively have a sequel statement, which is a whole lot of, um, inline table statements. Um, they could be functions in the, uh, but once you have nested, so a certain number of layers of sequel down, it becomes quite hard to tease out programmatically what the sources and targets are anymore, because you’re effectively going down quite a deep hole, a simple statement like slate.

Um, So late to value from table eggs, joins table. Why to produce output Z. Yup. You could probably infer the lineage quite cleanly, but you can pretty quickly run into complexities once you’ve got multiple layers of needs to tables and computations down. And what about, you know, we’re seeing a lot of data engineering teams now using.

I think code to do a lot of their transformations, airflow for scheduling and orchestration of it. Um, my impression is it makes it worse, right? We, we move away from, uh, the, the standardization we get with sequel, the non patterns of code to replace them. You know, you can import anything you want. You can write it in so many different ways to achieve the same task.

So. Blobs of pricing code being scheduled is probably going to make that lineage even harder, right? Yeah. Yeah. I think the whole, um, it is more and more people, I guess, um, generate, generate the code or generate this sequel on demand effectively then. At build time, there is no lineage because it hasn’t been created yet.

It’s often credited on the fly and it can change constantly. So what you see and what I see runs can be two different things depending on how they code. So auto-generated code slightly problematic again, unless it’s been set up to also generate lineage on the fly is it’s built, but you, yeah, I guess it’s another complaint.

And what about these cold templating engines, things like DVT, you know, these things that are really hot at the moment, do they solve the problem? Do they give us automated lineage? Because they’re a templating technique? Um, it’s a, it’s actually a very topical one. I’m going to say yes and no. Um, If you’re inputs into the template, uh, follower loose, um, or left and right pattern, then you do effectively have your lineage.

So if you’re generating, say you’re generating a sequel or transformation pipeline using a. Uh, your left and right. So your inputs and outputs, um, that you’re fading into your template and knowing, so you sort of heavier, I guess you have your lineage upfront, then you push it through your template and you get your nicely constructed piece of secret at the other side that you execute.

I think, um, they help, as long as you go into their design with the intention that you need to be able to track that lineage. I think so it’s probably more of a mindset thing around, I need lineage. I need to edit minimum, know that lift the right and the transformation for every piece of templated code that I’m going to it.

So, you know, if we look at it, we would always say, uh, create a conflict based tool that, uh, generates lineage for you as well as your code. But, you know, there are some other techniques people can apply when they’re not writing stuff as cool as we do. So, um, what I’ve seen is as a code passer, right? So if you are writing in the language and you dibs or data engineers have some form of standards in terms of the way.

In theory, you can write a parser that takes each of those blobby black for the code and determines the lineage of what went into their code and what went out, right? Uh, yeah, absolutely. Yep. Your config would, I guess, provide that story. Yep. It goes into your code generator. So yes, if your Dave’s, uh, uh, disciplined in the work to a.

Some sort of mark down pattern or conflict pattern, then use that capturing the lefts and rights as they go along. And then that building the, uh, I guess that generating their executable code or state conflict. Right. But, but my, my I’m saying if you’re not going for a concert. Uh, way of working. So let’s look at other ways that, uh, data engineering teams could get lineage without taking the same approach we’ve taken.

So, um, in theory, you know, as I write the apartment code or this gala code or the SQL code, as long as they’re using a similar way of. Uh, in theory, you could custom write a parser that scans all that code and it gives you the lineage. Oh, sorry. Yes. I was sort of inferring that, uh, when I see it, um, they might use a markdown down language.

For example, says they write the code that putting back down. Takes and the code itself. So that basically leaving a little trail of breadcrumbs that are wrapped in a markdown syntax. When they run a, something like maybe a Sphinx or a document or over the top, it goes cool. Lots of tags, grab, grab, grab, grab, grab, and it produces your lineage map.

Cause it’s basically read through all the code, picked out all the sources and targets and transformations, these breadcrumbs, I guess. And you produce lineage.

What I mean, there’s some magic that actually read my code with no evidence and figure that out for me because I always write my code in the same way. Uh, yeah, I guess so. Yes. If you’ve followed a strict structure yep. You’ll pass with the, yeah, read it. Yep. I was thinking, assuming you are going to give it hints, but he could do it straight from the Australia.

That was my second thing was, you know, the idea that you could use markup to inline with your code or at the top of the bottom of your code, provide to the lineage engine to say, this is what done, and this is what we now. And ideally these are the rules we applied in a way that somebody who was looking at it could understand.

I personally, quite like a little bit of a mark down and beat it in the code. Um, just recently I did a piece of work where all the devs added, it was literally only an extra four or five lines of knock down, uh, into the code and we hit. Um, a document generator that published all that into confluence for them using their Mac down to basically create a fully fleshed out version of their code, along with additional links and hence in some snippets of information around money, it’s definitely do it.

So, so another technique people could use is basically writing a logging framework. So every time a piece of code ran, rather than the hints being better than the code, that code is actually broadcast. Uh, some of the themes to some form of via some form of API to an event engine where you store. So it’s saying, Hey, creating this table, Hey, updating this table, Hey, dropping this table.

Um, and those events could then be turned into lineage, right? Because we’re starting to broadcast the things we’re doing with the data, which could be put back together to tell us our lineage story of what was moving with. Oh, I love that one. That’s actually really cool. Yeah. No, that’s a great idea. Yep.

It would definitely work. Um, well, and that’s an evolving lineage story as well because there’s elements of conflict change. Uh, those logged events effectively, uh, uh, picking that up so you could see it and evolving lineage story over time. Nice. The other thing you could do is, uh, you’re using some kind of technology to create.

Uh, add update or delete that data. So most of those, you know, even outside of the treasury database world, most of those execution or storage tools, um, have some form of log-in as well. Right. It keeps tracking good. What was created and what was changed. So you could actually go through those logs and effectively create a story of lineage as well.

Couldn’t it? You could say. Um, you know, the, the data storage platform has told me these things are being touched at this time. And therefore you could infer that, you know, there was some learnings now the problem would be if you’re running a hundred of those things, um, there’s a little bit hard to know which ones when, in which order, right.

Because, uh, yeah. They’re all going to be updated at various different times. Yeah. Yeah. As soon as you said that, I suddenly, I suddenly thought, and then with. Yeah, exactly. If the reason you said, uh, potentially with, uh, a serverless on demand architecture, um, you’re suddenly tracking the homeless simultaneously and overlapped and disconnected events happening, but, um, we do elements of there.

We use a pub sub model for. And our events talking to each other. And that way they’re only loosely coupled because they talk to each other by one of the vents saying, Hey, I’m done. And the next one rates it message and sees calls that. So they loosely handoff handoff. But to look at the log when all that’s happening, it’s chaos because these things are firing simultaneously all over the place to inferior lineage would be.

Tricky potentially doable, but a lot of the stuff is simultaneous. So it’s not a parallel flow of file. Arrived at table was loaded. The table was dumped to the end and they ever assumption we’ve made with all these techniques as we control the entire end to end supply chain and programmatically, we can get that information.

So that’s all great until we bought Tableau on top of that. Because then we start losing the visibility of right. Well, that data, see it’s been, you know, used by Tableau, but what’s happening in their dashboard. What transformations are happening in their whatever locations, what metrics are being calculated.

So, you know, to get true into lineage, you actually need every moving part and your data infrastructure, you better architecture. So you’d be visible or talking to whatever’s doing the lineage. Yeah. Yep. Um, as soon as we start to cross, uh, I guess, multi multi-tone multi-platform type thing, then we have, we start to run into those, uh, I guess integration, disconnects, where your BI tool.

Users can create their own transformations and workflows. So, uh, your ETL layers, uh, be nicely tracked and then effectively, once your data is exposed, you’ve got another whole layer of potential transformation going on. Um, You’ve got to be pretty disciplined. If you start to round up that meta data and pull it back to, you know, keep all year lineage, uh, made it out of centralized, not impossible, but you’re starting to get into those, um, cross-platform tool integration challenges that everyone.

And I remember back in the, I think it was probably the warehouse go. Today’s not the ODI days. There was a OEM open meta data. Uh, something and it was, uh, back in the days we in business objects was actually a company, not part of a G-Man behemoth, um, and Cognos was around and, uh, it was even maybe Oracle discoverer.

It was an old, um, and the theory was, uh, each one of those vendors would comply with the standard for the semantic live with. Um, and so in theory, you, you could create a business objects universe on top of your data, and Cognos could read it, right. Well, no need to do any extra work because I was sharing that, that model, uh, storage and middle data layer.

Um, and you know, what happened was none of the vendors really supported it. So, um, you know, you never really got cross platform across all competitors and know. Probably what we’re seeing again at the moment is, you know, everybody’s pretending they can play together, but, um, they don’t want to play a hundred percent because then they lose their slightly unique, proprietary benefit.

Um, so we’ve seen that change or not, but you know, as well as the technology, you know, we’ve gotta come back to the core of why the hell do we want Linda G. Um, so I think the Nirvana is this idea that a data engineer’s going to see that there some data mutation and one of the data factories, and they want to understand all the code that’s going to be impacted when they add this new column and all the reports.

So they’re going to go into some pretty tool and they’re going to right click on it and say, show me the lineage, show me the money. And it’s going to give them the knots where they know, uh, and second time figuring out what they need to change. But yeah, in 20 years of Naval thing that happened, I’ve seen lineage being demoed a lot.

And one of the key criteria for buying a product, but it’s really have I seen a data engineer or a developer actually use the bloody thing. Yeah. Yeah. Um, funny you say that because just literally this week at a place I’m doing some work, there’s been a lot of discussion. A lot of people asking, you know, we, it has this source object.

I mean delivering the target or vice versa, where was this metric derived from? Where’s it come from? And it becomes quite tricky because the metric they’re looking at, you know, it’s an effect table. The fact tables come from a staging table stating table was built on top of three other staging tables under that these multiple source tables, um, short of reading through.

Pages of SQL code across multiple pipelines. Uh, it actually proves quite tricky, quite hard because there’s no, you can’t right. Click on their metric and say, you know, where did you come from? Because ideally you would do that. And it would say, cool. I came from here. I was transformed from here. And I ultimately came from this column in source system.

Uh, you know, and that’s, that’s quite nice, but yeah. Because the lineage is so hard to implement. Well, it just doesn’t exist. People go off and they search for a column name and the sole system and try and work it out themselves. Yeah. And so I think the key there is, it’s not the fact that the data move left to right.

Some as important as what we did when we were moving. Um, so it’s, it’s a way of publishing the rules that we’ve applied by code to change the data that actually has a value. And so most of the lineage tools don’t do that. They don’t actually expose what we’re doing to. Um, they just tell us we’ve moved it from here to here.

So it has some value, right? Because it helps us very quickly know where to look, which bits of code out of the 1000 pods and scripts that we’ve plumbed them to airflow. Uh, which of those we should start reading. Um, but it doesn’t give us the. Yeah. Ideally, we’ll move to a world where, when we say lineage, it’ll actually tell us a story, right.

Or tell us the beginning of the middle and the end, uh, or tell us when the big, bad Wolf came and did some bad things to our data. Uh, and it tells us, you know, when the data quality, uh, came along and, uh, it made our data right for us. Right. So we understand, um, the story behind that data in a way we can understand if we’re not technically correct to read scholarly, Um, I think the other thing that was interesting when we started with us was, uh, you know, the, the idea that we built, uh, an underground and London underground tube, um, the very quickly we kind of stumbled across something that’s been infinitely valuable for me.

And that’s the ability to view the lineage of the data as it moves through the system and actually trigger, uh, that data to execute, uh, from a point. Um, so what it means is, you know, for us, I can go in and say, well, look, there’s this table, this land in history. Um, and we’ve got that set as a manual execution at the moment.

So, um, rather than typically what would happen is when code’s been running as production, like code for us, um, as soon as new data tunes up, uh, all the dependent rows execute and the data moves through that. That mat to the end-user, but as we’re doing, uh, the initial development of a roles, we leave that a manual so that we can create some roles and then execute them and see, see what happened.

Um, and just iterate through that until we get the rules. Right. Um, and to do that, you know, what, what we do is we go into that lineage map and we find that history table, um, that we know we’re basing our rules on and we say run, and then it goes through the lineage. And if it goes out all the dependent rules that are upstream and runs those, and I’ve found that, uh, infinitely valuable for, um, uh, actually creating an underwriting on, on.

Um, you know, one of the things we probably want to move towards is, uh, being able to see any form of normally say happen with the data or those rules, uh, on that map. So, um, being able to see where things are starting to be detected as being a little bit. Um, and using the lineage map as a way of just focusing where in that flow it is without having to find it in some other ways.

So for me, it’s their idea of, of taking something that has really complex, like the London. Uh, and giving you a visual clue of where you might want to look next, you know, there’s a blockage of their station. Um, so either go and fix it or bypass it, um, that’s the value for me or the lineage and the new world.

Yeah, I, um, yeah, I think, yeah, lineage maps quite, um, quite impressive. Actually, it’s only, uh, it’s one of the early drafts, but it very quickly visualized, uh, All the pens that automatically showed where lines were disconnected, because it’d be two stations with no lines between them, or there’d be a station that, you know, had a lion coming out one side, but not the other side.

The lineage, that was quite a nice way of visualizing a whole lot of complexity down to a simple picture, which has got circles for which represent titles or objects in the lines between them. That show that there is a role that moves data on that line. And yes, I’m letting Shane right click on anything and make it run was also useful as well because, um, Simplify it a whole lot of it has interactions.

He didn’t need to know what was happening under the covers. He just knew that he wanted this line to run, to push something down to the end of the line where he’s expecting to see it turn up. Yeah, no, that was, uh, that was a nice little side project there. One which will be useful going forward. Yeah. And that’s one of the cool features, a lot of views and anger every day since we created it.

So lots more to add to that one, to make it even more magical. But right now it’s a simple way for me to do something that used to take me a half an hour. So. Yeah, massive value in that. Um, and it looks 62. So that’s important. All right. Well, I think that’s pretty much us done with lineage. I think what we’re saying is it’s hard.

There is no magic out of the box, uh, lineage thing at the moment, especially if you’ve got modern products and model developers and multiple. Um, so figure out what action you want to take off the lineage, uh, what you’re going to do. That’s going to save you time or make things safer, uh, and then focus on delivering that.

But so then, uh, your team can use lineage to make their life easier. Uh, and then probably monitor it to see if they actually use it or. You just thought it was a good idea. And like 10 years ago or 20 years ago, DMO, but nobody ever used it. It’s true. Thanks. Shines.

and that data magicians was another edge. All data podcasts from nodule in shape. If you want to learn more about how you can apply agile ways of working to your data, head over to agile data.io. .