Data Lineage, mapping your way to magic

Nov 19, 2020 | AgileData Podcast, Podcast

Join Shane and Nigel as they discuss data lineage, what it is, why people want it and what it can actually be used for.

Recommended Books

Podcast Transcript

Read along you will

PODCAST INTRO: Welcome to the “AgileData” podcast where Shane and Nigel discuss the techniques they use to bring an Agile way of working into the data world in a simply magical way.

Shane Gibson: Welcome to the AgileData podcast. I’m Shane Gibson.

Nigel Vining: And I’m Nigel Vining.

Shane Gibson: And today, we thought we’d have a little chat about a thing called “Lineage”. So lineage has been around for a long, long time and the world of data, I remember some of the early ETL tools that we played with or used years ago, it must have been one of the things they were always pumping out was their ability to do lineage. So it’s been around for a while, it kind of goes out of favourite comes back and new tools come out, they don’t have it, they start building it. And if we talk about what lineage is, the way I explain it is, I use a visualization or a vision thing in my head, which is London Underground, the tube map of the London Underground. And it gives you the ability to see that if you start at one station, there’s a journey you can take through a number of other stations to get your destination. And lineage data for me is the same thing. It says that I have some data in a source system in a Data Factory. And that data passes through all these bits of code to end up in my report. And I want to be able to understand each of those steps in the journey, the data takes for some specific reason, there’s some action I want to take by understanding how that data moved. So for me, that’s what lineage is all about.

Nigel Vining: Yeah, in my mind’s similar here. It’s funny that he’s mentioned tools and 20 years, but I always get it something like the Oracle suite at the time, or Oracle Data integrator or those tools and lineage was effectively your left to right diagrams, how you built up ETL and data pipelines, you’d start from the left using a source object, you would grab a target for the object on the right in huge chunks of widgets. Between them, it was effectively your journey. The data integrator would say take the data from the left, go through the widgets in the middle and spit out something on the right. And that was your lineage, it was what we’d call the left to right. And it’s funny, because that’s still actually what we call it. Today, it’s left to right how we get there. And then you can add the complexity because they knew your right becomes another left and keeps going and going and going, like YouTube analogy.

Shane Gibson: Yeah, I think that’s only 10 years ago. You’ve been around long enough that it was Oracle warehouse builder, or actually it was hand coded PL SQL in your day, wasn’t it?

Nigel Vining: You’re right, it was warehouse.

Shane Gibson: So what are the challenges with all those tools, and the old days was Blackbox code. So even when we use those tools, we still had nodes sitting in the tool that had a bunch of code where the developer, or the data engineer now couldn’t use the pointy clicky features of the tool for some reason. And so we ended up writing a blob of code to do some make that was relatively complex. And typically, as soon as they happen, it broke the lineage. It was as if when you got to the station, between the station and the next station, there was a tunnel. And when you got to it, you could never tell which way the train went, because the tools didn’t really handle that. And I’m not sure where the latest versions of the lineage tools I’ve seen, they solved that problem. Because with those tools, you’re still writing a bunch of Blackbox code. So for us, one of the things that makes it slightly different. We have an agiledata.io. When we talk about lineages, you can’t actually write Blackbox code, you have to create a rule using a rule type. That’s known piece of conflict. We understand the technique or the patterns around there, and therefore we can always make lineage which means that there’s some things that you can’t do right now, because you’re too complex for something that we’ve designed a technique for. But that’s the trade-off is providing that simplicity, you lose some of the complexity until we figure it out.

Nigel Vining: Yeah, you’re right about the Blackbox that’s always the sticking point. And just recently I was helping a customer and they were moving between products, you’d call them ETL products. And one of the master features for the new platform product was lineage. And the first thing pretty early on, they realized that all of the legacy code, I’ve seen of it was embedded in Blackbox. So there is no lineage when you have a Blackbox, because it’s not a left to right throw transformation, it’s a left to the right. But the code in the middle could actually be broken in five more tables, producing some Team STEPPS, a whole lot transformation. So the relationship between left and right is nothing like it would look like on a lineage graph, you’ve lost all of the complexity, all the transformations, and actually what’s happening in that Blackbox. And you right, effectively, the organic way to do lineages, you need to keep your building blocks small and transparent. Everything is a left and a right with a transformation because that’s the only way you can really get a very strong lineage map out the other side of it, because all of that now in touch points are kept very visible.

Shane Gibson: I suppose if we look at the Blackbox code, it depends on the technologies that people were using. So if the data pipelines or the transformations in writing are purely SQL based. In theory, there’s a structure there, there’s some data going in some join stuff, you might be creating some tables, but they’re all based on SQL. So that’s a pattern that you probably can create lineage from. I remember when I was doing projects, where there were a lot of SaaS people involved, they had this great habit of using SaaS macros, and they were really powerful. But what it meant was in the middle of your code, you could go call something else, these days that kind of sounded like an API, really, you’d go and call this thing, it would do a whole lot of stuff and give you back the answer. And so the benefit of that was, you could write those bits of code once and reuse them time and time again. But within the lineage, the code that you’re looking at is really calling something that’s outside its control, outside of something that has no visibility. So those techniques again make it even more difficult to get lineage from those types of tools.

Nigel Vining: Yeah, I agree. And you’re coming around, and passing SQL inside of Blackbox or any type of code is it’s you can do it with some success to a certain level, but given that the power of SQL and some people like to use it to its best ability, you can effectively have a SQL statement, which is a whole lot of inline table statements. There could be functions in there. But once you have nested certain number of layers and SQL down, it becomes quite hard to tease out programmatically what the sources and targets are anymore, because you’re effectively going down quite a deep hole. A simple statement like select, select a value from table ‘X’, joint table ‘Y’ to produce output ‘Z’, you could probably infer the lineage quite cleanly. But you can pretty quickly run into complexities once you’ve got multiple layers of nested tables and computations down.

Shane Gibson: And we’re seeing a lot of data engineering teams now using Python code to do a lot of their transformations, airflow for scheduling and orchestration are that my impression is that makes it worse. We move away from the standardization we get with SQL, the know and patterns of code to replace them, you can import anything you want, you can write it in so many different ways to achieve the same task. So blobs of Python code being scheduled is probably going to make that lineage even harder.

Nigel Vining: Yeah, I think the whole is more and more people. It generate the code or generate this SQL on demand effectively at build time, there is no lineage because it hasn’t been created. It’s often created on the fly, and it can change constantly. So what you see and what if the runs can be two different things, depending on how their codes or auto generated code, slightly problematic unless it’s been set up to also generate lineage on the fly, is it’s built, but that’s another complexity.

Shane Gibson: And what about these cool templating engines, things like DBT? These things are really hot at the moment. Do they solve the problem? Do they give us automated lineage because they’re a templating technique?

Nigel Vining: It’s actually a very topical one. I’m going to say yes or no. If your inputs into the template follow a loose, left and right pattern, then you do effectively have your lineage. So if you’re generating a SQL transformation pipeline using a template, you’re left and right, so your inputs and outputs that you’re feeding into your template. Now you have your lineage up front, then you push it through your template, and you get your nicely constructed piece of SQL at the other side that you execute. So I think they help. As long as you go into the design with the intention that you need to be able to track that lineage, it’s probably more of a mindset thing around. I need lineage, I need to at a minimum know the left the right, and the transformation for every piece of templated code that I’m going to produce.

Shane Gibson: So if we look at it, we would always say, create a conflict based tool that generates your lineage for you as well as your code. But there are some other techniques people can apply when they’re not writing stuff as cool as we do. So, one I’ve seen is a “Code Parser”. So if you’re writing in a language, and you did three data engineers have some form of standards in terms of the way they write in theory, you can write a parser that takes each of those blobby Blackbox about a code and determines the lineage of what went into their code and what went out.

Nigel Vining: Yeah, absolutely. It goes into your code generator. So if your devs disciplined in work to some sort of markdown pattern or conflict pattern then that capturing their lefts and rights as they go along and then they’re generating their executable code off that conflict.

Shane Gibson: Right. If you’re not going for a conflict based way of working. So let’s look at other ways that data engineering teams could get lineage without taking the same approach we’ve taken. So in theory, as they write the Python code, or the Scala code or their SQL code, as long as they’re using similar way of writing it. In theory, you could custom write a parser that scans all that code and gives you the lineage.

Nigel Vining: Yes, I’m inferring that. When I said, they might use a markdown language, for example, says they write their code, they’re putting markdown tags in the code itself. So they’re basically leaving a little trail of breadcrumbs that are wrapped in Markdown syntax, when they run a something like maybe sphinx or a document or over the top, it goes lots of tags, grab, grab, grab, grab, grab, and it produces the lineage map, because it’s basically read through all the code, picked out all the sources and targets and transformations, these breadcrumbs and you produce lineage.

Shane Gibson: No, what I mean, does some magic that actually read my code with no hints and figured it out for me because I always write my code in the same way.

Nigel Vining: Yeah, if you followed a strict structure your pastor would then read it. I was assuming you are going to give it hence, but he could do it.

Shane Gibson: That was my second technique, the idea that you could use markup to either in-line with your code or at the top of the volume of your code, provide hints to the lineage engine to say, this is what went in, and this is what went out. And ideally, these are the rules we applied in a way that somebody who was looking at it could understand.

Nigel Vining: Yeah, I personally quite like a little bit of markdown embedded in the code. Just recently, I did a piece of work where all the div’s added, it was literally only an extra four or five lines of markdown into the code. And we had a document generator that published all that into Confluence for them, using that markdown to basically create a fully fleshed out version of the code along with additional links, and hence in some snippets of information around lineage definitely do.

Shane Gibson: So another technique people could use is basically writing a logging framework. So every time a piece of code ran, rather than hence being better than the code, that code is actually broadcasting some events via some form of API to an event engine where you store it. So it’s saying, “Hey, creating this table, updating this table, dropping this table”. And those events could then be tuned into lineage. Because we’re starting to broadcast the things we’re doing with the data, which could be put back together to tell us our lineage story of what was moving where.

Nigel Vining: So I love that one. That’s actually really cool, that’s a great idea. That would definitely work, and that’s an evolving lineage story as well, because there’s elements of conflict change, those load events effectively picking that up. So you could see an evolving lineage story over time.

Shane Gibson: The other thing you could do is, you’re using some kind of technology to create, add, update, or delete that data. So most of those even outside of the true database world, most of those execution or storage tools have some form of login as well. It keeps track of what was created and what was changed. So you could actually go through those logs and effectively create a story of lineage as well, couldn’t it? You could say, the data storage platform has told me these things have been touched at this time. And therefore, you could infer that there was some linearity. And now the problem would be if you’re running 100 of those things, it’s a little bit hard to know which ones when in which order, because they’re all going to be updated at various different times.

Nigel Vining: Yeah. As soon as you said that, I certainly thought and the reason you said, potentially with a serverless on demand architecture. You’re suddenly tracking homeless simultaneously, and overlapped and disconnected events happening. But we do elements of that. We use a Pub Sub model for our events talking to each other. And that way, they’re only loosely coupled because they talk to each other by one event saying, “Hey, I’m done”. And the next one reads that message so they loosely handoff, but to look at the log, when all that’s happening, it’s chaos, because these things are firing simultaneously all over the place to inferior lineage would be tricky, potentially doable, but a lot of the stuff is simultaneous. So it’s not a parallel flow of a file arrived at a table was loaded, the table was dumped to the end.

Shane Gibson: And the other assumption we’ve made with all these techniques is we control the entire end to end supply chain, and programmatically we can get that information. So that’s all great until we put Tableau on top of our data. Because then we start losing the visibility of, that data sets been used by Tableau, what’s happening in that dashboard? What transformations are happening in there, what aggregations what metrics are being calculated? So to get true into a lineage, you actually need every moving part and your data infrastructure, your data architecture to be visible, or talking into whatever is doing the lineage.

Nigel Vining: Yeah. As soon as we start across multi tool, multi-platform type thing, then we start to run into those integration disconnects where BI tool as users can create their own transformations and workflows. So your ETL layers been nicely tracked. And then effectively, once your data is exposed, you’ve got another whole layer of potential transformation going on. You’ve got to be pretty disciplined if you start to round up that Metadata and pull it back to keep all your lineage, Metadata centralized, not impossible, but you’re starting to get into those cross platform tool integration challenges that everyone has.

Shane Gibson: And I remember probably the warehouse builder days, not the early old days, there was a company which was called as OEM and it was back in the days when business objects was actually a company not part of a big German behemoth. And Cognos was around and that was even, maybe Oracle discover who was that old. And the theory was, each one of those vendors would comply with the standard for the semantic layer within the tool. And so in theory, you could create business objects universe on top of your data. And Cognos could read it with no need to do any extra work because they were sharing that model, that storage or Metadata layer. And what happened was, none of the vendors really supported it. So you never really got cross platform or cross tool compatibility. And probably what we’re seeing at the moment is, everybody’s pretending they can play together, but they don’t want to play 100%, because then they lose their slightly unique proprietary benefit. So whether we’ll see that will change or not. As well as the technology, we’re going to come back to the core of why the hell do we want lineage? So the Nirvana is this idea that a data engineer is going to see that there was some data mutation in one of the data factories, and they want to understand all the code that’s going to be impacted when they add this new column, and all the reports. So they’re going to go into some pretty tool, and they’re going to right click on it and say, show me the lineage, show me the money, and it’s going to give them the answer with a ‘No’ and save them time figuring out what they need to change. But 20 years, I’ve never seen that happen. I’ve seen lineage been demoed a lot and one of the key criteria for buying a product. But it’s really have I seen a data engineer or a developer actually use the bloody thing.

Nigel Vining: Yeah, it’s funny you say that, because just literally this week at a place I was doing some work. There’s been a lot of discussion, a lot of people asking, where has this source object been delivered in the target or vice versa? Where was this metric derived from? Where’s it come from? And it becomes quite tricky, because the metric that they’re looking at, it’s a fact table. The fact tables come from a staging table, staging tables built of on top of three other staging tables, and these multiple source tables, short of reading through pages of SQL code across multiple strip pipelines. That actually proves quite tricky, quite hard, because you can’t right click on their metric and say, where did you come from? Because ideally, you would do that. And it would say, “Cool, I came from here, I was transformed from here. And I ultimately came from this column in the source system” and that’s quite nice. But because the lineage is so hard to implement that well, it just doesn’t exist. People go off and they search for a column name in the source system and try and work it out themselves.

Shane Gibson: Yeah, and the key there is, it’s not the fact that the data move left to right as important as what we did when we were moving it. So it’s a way of publishing the rules that we’ve applied via code to change the data that actually has the value. And so most of the lineage tools don’t do that, they don’t actually expose what we’re doing to the data. They just tell us we’ve moved it from here to here. So it has some value because it helps us very quickly know where to look, which bits of code out of the 1000 Python scripts that we’ve plumbed into airflow, which of those we should start reading but it doesn’t give us the answer yet. Ideally, we’ll move to a world where when we say lineage, it’ll actually tell us a story. It’ll tell us the beginning, the middle and the end. It will tell us when the big bad wolf came and did some bad things through our data. And it tells us when the data quality woodcutter came along and made our data right for us, so we understand the story behind that data in a way we can understand if we’re not technically literate to read scholar code. I think the other thing that was interesting when we started with the idea that we built underground and London Underground Tube map. Very quickly we kind of stumbled across something that’s been infinitely valuable for me. And that’s the ability to view the lineage of the data as it moves through the system and actually trigger that data to execute from a point going forward. So what it means for us, I can go in and say, “Well, look, there’s this table, this landed in history. And we’ve got that set as manual or execution at the moment”. So rather than typically what would happen is when codes been running as production, like code for us, as soon as new data turns up, all the dependent rules execute and the data moves through that map to the end user. But as we’re doing the initial development of our rules, we leave that in manual so that we can create some rules and then execute them and see what happened. And just iterate through that until we get the rules right. And to do that, we go into that lineage map. And we find that history table that we know we’re basing our rules on, and we say run, and then it goes through the lineage map and figures out all the dependent rules that are upstream and runs those. And I found that infinitely valuable for actually creating and iterating on data. One of the things we probably want to move towards is being able to see any form anomalies that happened with the data or those rules on that map. So being able to see where things are starting to have been detected has been a little bit funky. And using the lineage map as a way of just focusing, we’re in that flow it is without having to find it in some other ways. So for me, it’s that idea of taking something that’s really complex, like the London Underground, and giving you a visual clue of where you might want to look next. There’s a blockage at that station. So even go and fix it or bypass it, that’s very familiar lineage map and the new world.

Nigel Vining: I think our lineage maps quite impressive actually. It’s one of our early drafts but very quickly visualized all the pins that automatically showed where lines were disconnected, because there’d be two stations with no lines between them or there’d be a station that had a line coming out one side, but not the other. So the lineage map was quite a nice way of visualizing a whole lot of complexity down to a simple picture, which has got circles forced, which represent tables or objects, and the lines between them that show that there is a rule that moves data on that line. And right click on anything and make it run was also useful as well, because it simplified a whole lot of his interactions. He didn’t hit need to know what was happening under the covers. He just knew that he wanted this line to run to push something down to the end of the line where he’s expecting to see it turn up. So that was a nice little side project, that one which will be useful going forward.

Shane Gibson: And one of the core features of [inaudible 00:28:31] every day, since we created it. So lots more to add to that one to make it even more magical. But right now, it’s a simple way for me to do something that used to take me half an hour. So massive value in that. And it looks easy too, so that’s important. Well, I think that’s pretty much just done with lineage. I think what we’re saying is, it’s hard. There is no magic out of the box lineage thing at the moment, especially if you’ve got multiple products and multiple developers, and multiple bits of data. So figure out what action you want to take off the lineage? What you’re going to do that’s going to save you time will make things safer, and then focus on delivering that so that your team can use lineage to make their life easier, and then probably monitoring to see if they actually use it, or you just thought it was a good idea. Like 10 years ago, or 20 years ago, it’s a great demo, but nobody ever used it.

Nigel Vining: It’s true. Thanks, Shane.

PODCAST OUTRO: And that data magicians was another AgileData podcast from Nigel and Shane. If you want to learn more about how you can apply Agile way of working to your data, head over to agiledata.io.