Data Lineage Patterns with Tomas Kratky
Guests
Resources
This exciting episode of the AgileData Podcast, Shane Gibson chats with Tomas Kratky to chat about the critical role and evolution of data lineage in the context of modern enterprises.
They discuss:
-
Tomas Kratky’s Journey: Tomas shares his transition from a traditional software engineer to a data enthusiast, leading to the inception of his company Manta, focusing on data lineage.
-
Data Lineage’s Significance: Emphasising data lineage as more than an end in itself, Tomas describes it as a tool for unlocking potential in large enterprises, enhancing visibility, and fostering agility.
-
Integration Challenges in Data Systems: The historical challenges of integrating various data systems, highlighting the struggle to achieve comprehensive data lineage in increasingly complex data stacks.
-
Misconceptions in Data Lineage: Common misunderstandings about data lineage, particularly its oversimplification and varied interpretation across different company sizes and sectors.
-
Enterprise Complexity and Data Lineage: The stark contrast in data complexity between smaller companies and large enterprises, underscoring the intricate nature of lineage in substantial, diverse environments.
-
The Active Role of Metadata: The importance of not just collecting but actively using metadata to make informed decisions and automate processes, emphasizing end-user needs.
-
The Push for Metadata Standards: Open standards in metadata, like Open Lineage, discussing the challenges and potential in standardising metadata practices across the industry.
-
Building Comprehensive Lineage Solutions: The complexities and resource demands of developing effective data lineage solutions, especially pertinent to large, varied data landscapes.
-
Future Directions in Data Lineage: The future of data lineage, focusing on deeper integration of metadata into data products and shifting from mere collection to active utilization and analytics.
-
Lessons from the Past for Future Data Practices: Insights into applying historical lessons innovatively in data management, valuing the experience garnered in large enterprise settings.
Listen on your favourite Podcast Platform
| Apple Podcast | Spotify | Google Podcast | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |
Recommended Books
Podcast Transcript
Read along you will
Shane: Welcome to the AgileData Podcast. I’m Shane Gibson.
Tomas: Hello, everyone. This is Tomas Kratky, or for simplicity, just Tomas.
Shane: Hey Tomas, thank you for coming on the show. In this episode, we are going to go into a whole conversation around data lineage. So you’ve worked in that space for a long, long time, before we kick into that, why don’t you give a bit of background about yourself for the audience, so they can understand who you are and where you came from.
Tomas: Thank you very much and let me start by saying , thanks for having me. I’m a big fan. I must say that. I listen to the podcast and I’m one of the fans following your post at LinkedIn, so I’m just really happy to be here. A little bit about myself, I’m a very traditional software engineer.
I started coding in Java and building small systems and larger systems and actually. To data or go to data very, very soon, when building some enterprise applications for my customers. Back in Czech Republic, it was basically 2000 and I am in love with data since that moment. It took me a while to really start an independent company.
I had an opportunity to work for a growing professional services, custom development business back in Europe. And then I got an opportunity to start something completely independent. I took that opportunity, it was 2017 or end of 2016. I am on my mental journey since that moment.
Shane: So if I understand it right, Manta is a hundred percent focused on data lineage, that’s what it does. That’s its meat and potatoes.
Tomas: I always say data lineage, basically not the end, right? Data lineage is just a mean to something. When I started the company, I saw data lineage as a fundamental piece of information that if it’s collected and if it’s used in a proper way, can unlock a lot of power for my customers back then, which is large enterprises.
Doesn’t matter if it’s finance, retail, healthcare, it’s all the same. They are large and complex. So I saw that as a way to help them, how to give them more visibility, more transparency. And also that was most important thing for me. How to give them or how to enable agility for them, because that’s one of the secret things about Lineage.
Shane: If I remember many years ago, when I was running my consulting company and we were helping out large organisations, large in New Zealand terms, which is probably tiny and European and US terms, but you know, they were the big banks, the big insurance companies, the big government agencies.
And back then we were at the stage where yes, we were buying suites of products. They were implementing the Informatica’s, the IBM’s, the Oracles of the world back then. But even though we were implementing suites, often those suites weren’t really integrated. Like a lot of the vendors at that time pretended they had integration in the PowerPoint slides.
But, they’d aqua-hired or bought products, the different technology stacks under the covers. The tools weren’t really integrated. And so we were often either buying or custom building a lineage capability to get that visibility from beginning to end of all that data movement. And so it became really important around that time to have that lineage capability to give us that single pane of glass.
And then what we’ve seen is we’ve seen the Modern Data Jenga Stack, we’ve seen this idea that actually rather than buying suites of products, we now buy lots of little components and cobble them together ourselves make our own Lego blocks. We end up with somewhere between 3 to 20 different products that we’re integrating to move data around, and we still don’t see lineage in that space.
We see lineage within those tools, but we don’t see that single pane of glass. And so lineage, from my point of view, is being relegated to just a feature, not a category. Which is funny given the modern data stack is all about carving out new categories. It used to be features, profiling data is somehow a new category.
So what’s your view? Why has Lineage been such a strong capability required in enterprises for so long and then all of a sudden it gets relegated as a dbt feature? If you use dbt, but you can only see the lineage within dbt, you can’t see the beginning to the end of your entire.
Tomas: It’s an excellent question. So there are a lot of different answers. What is data lineage and what does it represent to me? Because when you look around, you see different people thinking about lineage in a very different way. For some people, when they say data lineage, they think about traceability.
How a specific data record is moving in my environment. So when I see number seven on my report, show me the specific data used for that calculation. Some people think in terms of what we call , runtime manager operational knowledge. So it’s basically runtime information about a specific workflow being executed, connecting to tables, or potentially connecting multiple columns together and producing some results.
So that’s very useful, for example, for incident management because when it comes to incident management, you really care only about the workflow executed just a minute ago. But then if you switch your view and you start thinking about more difficult things like change management, that’s the most critical thing for every organisation, ability to change things fast and in a safe way.
So for change management, what do you truly. Is record level lineage or traceability? What you need is runtime lineage. What you need, no, you are missing something. What you need is you need to understand all possible dependencies in your environment. You need to understand how things may go or flow in your environment what may happen depending on specific conditions.
So what you actually need is what we call design lineage. So it means what are all possible ways in your environment, how data can flow. Is it the same as runtime lineage? No. You may have very critical, exceptional workflows, but they are only triggered, I don’t know, once per year, but if you miss them when, for example, migrating or changing something in your environment, you can cause huge incidents that will.
Be seen in a few months, maybe down the road, but they will cost you like 100, 200 million. So that’s the beauty and also the challenge around data lineage. And one of the reasons why many people they simplify the topic significantly and they focus on something that’s easy and simple, and now it’s just a feature because everyone can do SQL on Snowflake and tell you the traceability for your data. It’s super easy. It’s super simple. So the other part of the issue is when it comes to LinkedIn and communication about Modern Data Stack, we think in terms of small and mid-size companies.
Or let’s say startups with 500, 1000 employees and they run business in a very different way than let’s say big banks or big insurance companies. That gap is really significant. , and I must say I have almost zero experience with smaller, midsize company. I mean, I live every single day, every minute of my day working for those giants like Fortune 500 companies, and that gap is really significant we keep talking how complex our environment is, but we technically have one platform, one or two reporting tools a year. We are really lazy developers, so we can’t build proper dbt models. But now think about those really giant in companies they have, I don’t know, our customers, they have 10,000 to 20,000 applications and multiple business units.
Every application is like data system of its own. That complexity is just a completely different universe, which is, by the way, one of the reasons why we were able to succeed back then competing. Big companies like Informatica, you mentioned simply because it was so difficult that no one was able to do it, and it’s basically still the same thing today.
That gap, thinking about small environments and applying this to. Any environment. This mindset is actually the second reason why we see lineage being seen as a feature more or less. And then of course, the last piece is how super successful data catalogs are in terms of. The narrative, how they can pretend and present that basically, this is the one place where you go for anything you need with data and metadata. So I always make fun of them. I say to my customers, I mean, okay, like go for it. You are telling me that you can buy one piece of software. They can collect all metadata. They can provide metadata to all possible use cases you can benefit from and help all possible users you may potentially have in your 100,000 employee environment.
So yes, if there is a solution like that, just tell me. I’m going to buy it myself. This is the last piece, how. Some companies are able to, let’s say, influence the market and the way how we think, and I admire that. It’s amazing, but unfortunately it’s also, I would say, damaging the market a little bit.
Shane: So when you talk about those patents, , you talk about data traceability. So I’m on a report, a dashboard. I’ve got a number. The number’s 42. Where the hell did it come from? Which saw system was system A capture? What was done to it? What business rules were applied before? I can see that number.
You talked about runtime lineage, observability effectively. I’m running some workflows through my system. Tell me what’s running, what’s not running, where the problems are. You talked about change management release management, C I C D type of things of, okay, I’m about to make this change.
What’s the impact? What’s the blast? Radius, you talked about effect. You call it design lineage, but I call it architecture, tell me, all the blocks have got all the different technologies. How do they integrate? Where’s the data moving? How do they talking to each other? You take those things and you call it lineage.
Whereas in the market now, we would call it DataOps or observability, or having forbid, data mesh. So again, it’s this use of language is important. Because your view of lineage is not just a graph, a dag that says this bit of data goes from here. It is visibility of everything that’s happening to data across that organization.
Is that what I’m hearing?
Tomas: That’s a very good point because that’s one of the issues I have. If you think about it, just show me less successful. In past 20 years than metadata. There was so much promise. So much promise, and I think that every time we did it, we failed.
And I have a lot of comments about that, but finally we see things changing and we were one of the first companies just saying, please, , stop thinking about stupid stuff like metadata warehouse and one ultimate metadata repository. We are obsessed and I have the same issue with data. We are obsessed with idea of collection.
I don’t know if it’s this western type of thinking that we need to have more and more and more, but we are obsessed with collect. Instead of being obsessed with usage, I would rather have 1% of data. Most companies have, but all the data I have should be in use and there are many patterns, how we can put data in use.
We always think about this, the whole self service and all that crap, it’s important, but I always use my mother or my father as an example. My father is actually very important manager working in a nuclear power plant, so he needs data. on daily basis, but he will never log in to any interface to what we call search for data.
He needs data to be integrated, embedded into his workflows. So that’s how he can benefit from data and that’s . Much harder for us data engineers because it means we need to first understand what his needs are. Then we need to understand how to really deliver data, when it’s needed, how it’s needed, and how to really embed it into his workflow, not just force him to do things differently, force us to help him and the same with metadata and what you said, like a super deep and super clever because one of the change we see, and we are one of the first companies removing the direction is stop thinking about collection. Collection is important, but you don’t need to have all metadata. So Manti is super well known for having an open ecosystem of partners. So we work with companies like , ion. Big id I B M.
And we not only provide metadata, we also receive metadata, so quality scores or privacy scores or PI data classification and all that stuff. And now the question is how to use it. And I actually like how garden call it, they call it like active metadata. So how to activate it. And there are many ways how metadata can be activated and that’s the only moment.
When you actually see use of it and you stop thinking about building one huge place where all meta, I mean, who cares? Who cares? I don’t care if I have all metadata in one place or if I have 20 different like repositories with some type of metadata. The key thing is, To integrate and to activate.
And that’s I think where I see the market finally moving.
Shane: So many things going unpack there. I’ve got a whole lot of notes, so let’s take it from certain different angles and we’ll keep looping back on this one. Cuz for me, this is really, really interesting. So that idea of your dad
in the past when I’ve been consulting with companies, they go, we want self. Self-service for everybody. And I have a conversation with them about, there are some people who want to serve themselves with data, but the majority don’t. And Nickson actually had a really nice article they wrote on it.
They talked about self-service versus silver service. So this idea of, I just wanna be served a beautiful meal. So I can eat it. Where actually, yes, I want to go up to the buffet or go into the kitchen and do the work myself. And when you look at the majority of the people that work, they don’t want to play with data
that’s not their job. They’ve got a job to do and they need data to inform that job, and that data needs to inform what should they do next? Or I’ve got a blocker what are my choices? And then they just want to get on and do their job. So bit like your father, he uses data on a daily basis, but he’s not gonna go and play with it.
He’s , okay, if the reactor’s getting hot, alert me. Don’t make me go and do self-service to say, how’s the reactor going today? , , it’s like, I know that if it gets hot, that’s probably a bad thing. Flash a red light. And so this idea that yes, self-service is important, but I agree with you, we’re fixated on this idea.
Everybody’s a data person. Everybody’s like us and wants to play with it. We’re actually the majority of the world just want to use the data to do the next thing they need to do and get on with their day job. And so if we take that to Lineage, so in our product, we’ve got Lineage built in because we think it’s a core feature,
our view is any product that uses data should provide a form of catalog if it’s using. It should be able to surface a view of that data it should be able to surf what the data looks like, any context, within the scope of what it’s using it for and we’ll talk about interoperability in a single view later.
The second thing is it should show lineage, if it’s doing something to the data, it should be able to visualize what it’s doing because there’s value in that. And they should just be table stakes I shouldn’t need to go and buy a third party. To be able to see those two things when I have a single system.
So as we’re building our lineage out, we think about it the same way you do. We think about what action we could take. So one of the things that was really interesting for me was when we built our first version of ’em, we call it data map. We worked out that actually because we’re transforming the data, if I could click on a node in that lineage map and say, Run forward or run backwards.
It was really valuable because I didn’t have to go and find what we call a business rule, but think of it as a transformation. I’d have to go find the search for the transformation and go, that’s the transformation I want. Now. Run it, run it forward, run it backwards. I could go into that visual map, click on it and get it to do a task for me and save minutes, which is what we’re about.
And we’re currently working on this idea of column level lineage. I love the sexy demos every vendor does when they talk about active column lineage or we’ve got column level lineage because what do they do? They show you a pretty graph with two input tables. One transformation, maybe one aggregation node and a report.
And you go and look in even a small customer, you go look at their production environment, and I call it mad persons knitting. There’s hundreds of nodes. They’re like spaghetti. We’ve thought about, well, there are three actions we want to take. And they kind of align with yours,
which is somebody’s using a number on a reporter, a dashboard, and they’ve asked that famous word, where the hell did this come from? And what the hell did you do to it? So we wanna start with that and just get given a story. Lineage of where that data come from. Second scenario is somebody’s got a systems of capture that we are bringing the data in and they’re gonna do something bad to it.
They’re gonna add a column, drop a column, change a column, remove a table, split a table. They’re gonna do some bad things that are gonna impact us, so, If they do that, what are all the things that might be impacted? And then the third one is, I’m about to make a change, I have a transformation or a rule, and I’m about to make a change to that rule.
So I’m in the middle of the spaghetti. Tell me everything on the left and everything on the right that could be impacted by this change. And so for me, those are the three actions that are traditional lineage serve. What’s your view? apart from those three, do they resonate and have you seen any other ones?
Tomas: Definitely there’s the, a very powerful start when I think about a lot of customers, that’s basically the first thing, they need and what they want and it helps them with many different, like use cases and in many different scenarios.
Let’s assume you have really detailed column level across multiple technologies. You understand semantics of those transformations. You understand what’s actually happening to your data. You are not just connecting two dots together. You actually really understand what’s the.
What’s actually happening to this data point and how it’s transformed. So if you have all of this, I believe me personally, the most powerful subset of metadata you can get, and you can start thinking about a lot of very interesting use cases.
Maybe you have a tables in your solution. You have labeled what are really like. Data sources, this is really key data source, or you can label it automatically by using Lineage because you see that 90% of reports is using data from this place. And now in your solution, I bet that you have some profiling capabilities.
So you have access to information about, let’s say quality of data going up and down or potential, some data quality issues on the. So what you can do is immediately act on it and you can immediately notify the user of your report and tell them be careful. In next 60 minutes I would not use this report.
Or you can completely shut down the report and just. Do something instead. , okay, give us 60 minutes. We are working on it. So you can be proactive and it can be fully automated. And that’s what we see with data as well. You use data in your workflows and processes to fully automate them.
That’s what we want. So we are doing the same with metadata. You can think about optimization. You actually understand all possible ways how data can flow in your environment. You understand density, you understand the example I mentioned earlier. You understand that like most of the data for your dashboarding, analytics and reporting is coming from this place and you understand critical path for your data and you understand maybe overloaded components in your architecture and you can use all of that to.
Because you have all that information, so you can do suggestions and recommendations, how those workflows can be done, maybe different or you can proactively notify one or two people who are responsible for the architecture that there may be an issue with something. Because again, you see that the thing was changed 17 times in last one week.
So I something is, So someone should check on it. Or another really powerful example is you can simply say, what are the workflows that you can completely decommission? Because you may have flow of data and there may be a tab in the end, but there is no outflow. No one is successing the table. There is no report connected to the table.
It may be sitting there. Last week we had a very interesting discussion with one of our customers and just applying this use case to a subset of their environment, they are saving something like 100,000 US dollars per month. Just a stupid thing like that. It’s stupid, but that’s exactly how if data link is used properly, how you can unlock a completely new universe of use.
Shane: That cost one’s really interesting because in the old days we’d have a server sitting in a data center, or heaven forbid, in a cupboard, it would’ve a database on it. Would you be using Oracle or SQL Server or, yeah, it could even be Teradata, but it was a sunk cost. So if we were running transformations, we were running workloads, the impact was, other workloads ran slower, the reports ran slower.
Or if we’d segmented the warehouse out properly, our transformations ran slower and we had to optimize the schedule. Nowadays, every time we run something, it costs us because we’re using electric compute. And so we talk about the cost of these cloud analytics databases. And yes, there are some simple things you can do.
The wait time, the ability to shut it down after a minute of inactivity rather than 30 minutes. So that way you’re not giving away free credits. But the other one we often look at is, okay, we’re doing all this compute on this data that’s not being used. And so it comes back to this idea you’ve got that actually.
When we do anything around that data ops, that lineage, that observability, it should be based on a use case that has an action that we want to change. And what I’m seeing is most of the conversations around lineage right now in our market is around data engineers or analytics engineers. We’re talking about lineage for ourselves, not lineage for our consumers or our stakeholders.
and so it’s not, a case of let’s identify the data that’s used the most and see how we can make it more accessible or faster on the end report. Or actually, what are they clicking on the report because maybe we can alert them when that behavior happens so they don’t have to go check it. .
Move from silver service to that silver service. So I think that’s an important pattern, whenever we do this work, we should be thinking about our consumers and our stakeholders more than we think about ourselves. That seems to be the way that you think about it with your customer. And I
Tomas: think it gets me back to the pattern thinking and for a second, getting back to data itself, how we make use of data.
One very simple way is we give data to people and we give them an opportunity to search. In data so I can build my SQL or I can do something drag and drop, doesn’t matter, I’m searching in data for answers. Or it can be a sexy Google type of interface. Doesn’t matter. That’s one very simple way how we can make use of data, but then when we really want to help people, we need to think how exactly it was the case with my dad.
How. Transform data and prepare information they need and how to deliver it. So it’s part of the workplace and workflow. So they do not need to leave it. They do not need to do anything extra. So we all have these like beautiful examples when, We open our mobile banking application and we have some really useful recommendations or insights into our spend from our bank.
Instead of just giving us access to all transactions, which of course we can do. So thinking about it differently thinking about the end user, thinking about who can benefit from data and. To really help them. And the last case is this process automation thing.
Data is powerful. We can use data to automate a lot of things. And then when we move to metadata, it’s exactly the same scenarios. So we can either have metadata somewhere and let people use it. And that’s exactly the catalog use case. You use catalog and you use metadata in it to search for data to find data you need.
And that’s very powerful use case for some user groups. they need new data sets, they need to build new analytics, and they are using catalog to find data they need. But in the same time, the examples I gave you, it’s more. Don’t even show metadata to anyone, use metadata to do something for other people, and that’s the real power.
That’s what I believe. We must do more and more, and that’s how we can truly become data-driven organization. Otherwise, we are just like bullshitting people. , sorry for my
Shane: language. That’s all right. I call bullshit all the time. . So it is a complex problem though. I remember the previous generation of data catalogs.
So these are the ones, the elation of the Collibras, their whole premise was back in the days before it was called AI and maybe it was called machine learning. It was definitely after data mining. The problem was we have a catalog and when we install the catalog, it’s.
. There’s nothing in it and we gotta populate the catalog. And we learned that going, , there’s, 3000 tables and 30,000 columns, or heaven forbid you bought SAP in . And it’s just a pleth of, of mess catalog in all that information. So it was useful. Was never done.
We would never throw a thousand people in it for 10 years to do that work. So we ended up with these empty catalogs, and then there was a pattern that came out, which is, we’ll just throw the machine in it, we’ll throw machine learning at it. It will auto tag everything. And it was a great idea, but every enterprise that I worked with, where they went to implement one of those, we never saw the value out of it
what happened was, for some reason, The catalog was never populated. And there’s a whole raft of reasons it wasn’t the vendor’s fault, but it was, the organization, the process, a whole lot of reasons. So that’s hard, so that problem of saying, I’ve got this noise, how do I get the signal out of it?
And then we look at the complexity of all the technology stacks that if we were providing a single plane across all of them, they all use different languages. Sequel, Python, and even within sequel, I remember, we used to laugh when I was at Oracle. Everybody follows the standard, we just follow our version
We follow the Oracle version of SQL and Microsoft. Those evil people follow their version. And then we get into in complexity. So people love to automate things. So when they’re writing their transformations, they’d bring in templates that are looping. , when I was at SaaS where the macros.
When you tried to get the lineage, it was like, well, the lineage says, cool, this thing, and then 50 columns turn up, because we were using conflict driven design, which I think is what Gartner calls active metadata now. And so there’s all this complexity and if I look at what you have done, The fact that you’ve got large vendors actually licensing your product to do the work, that in theory they could do themselves, just shows that actually that problem is so complex that often they will just insource it to get that kind of value out of it.
So why is that? Why is building this visibility, this observability of lineage for those use cases We talked about traceability on a report, watching the runtime lineage, and seeing what it does change, manage. And that architectural view of all dependencies, why is it so hard to build a solution that delivers these?
Tomas: It’s actually a very good question. I keep thinking about this last couple years. Reason number one is if you think about basically anyone who is successful in metadata market today. They either exclusively focus on smaller companies and by smaller I mean 500, 1000, 2000, maybe up to 5,000 employees, more or less.
And these companies, they run on one Snowflake instance or one Databricks instance. They have few reports or maybe few hundreds reports, but it’s simple, what I call Van SQL environ. And we pretend now every organization out there will adopt D B T, and now everyone will migrate to Snowflake.
But hey, what? Snowflake is already supporting Python and if not today will be soon. And today it’s Snowflake, but around the corner we already have two or three I bet really promising technology is going to replace it. I did it myself four times already. Kabul, Teradata, Teradata to Hadoop, Hadoop to Snowflake.
Right now it’s just it’s happening again and again and pretending that it’ll be different. This time is just, being stupid. So one part of the issue is that we have technologies evolving very quickly and no matter how hard we. . It’s just very, very difficult to agree on anything that would be one standard.
So we will all, every company around the globe will do transformations this way, and now it’s easy and simple. But even like thinking about Oracle or Microsoft, we were discovering while building scanners, we were discovering how, and just I’m switching to computer science language for a second, how the published grammar of Oracle.
Pls Q L is incomplete and how you can write stuff in Oracle that’s not in line with the grammar itself. So it’s if you start speaking English and using completely made up words and it’s working and people using it, and that’s not uncommon . People think about these very old technologies, everything is stable, but it is not Oracle is one of the most successful companies out there.
They keep developing the software and they are doing crazy, crazy stuff. So that’s one part of the problem. The variety of technologies and a lack of any standards. And it’s really what you said, it’s beautiful idea, every data system. , no matter how big or small, should provide its metadata.
Including clinic. Unfortunately, if you really think about it for anything that’s even close to custom development, it’s extremely difficult. So once you allow anyone to write a patent script and use it as part of your transformation, you are doomed. unless you work with someone like Manta. That’s one part of the complexity.
The other part of the complexity, especially lately, is how we merge together, which is great. I love it because I think there is a lot we can learn from software engineering and data engineering. Not just copy it, just learn from it and adjust and use it the way we need. But as we merge these groups together, we.
Moving to completely different type of languages and environments. And it’s creating even more complexity for data processing and for any form of visibility. And one thing I really deeply believe we heavily undervalue is we expect a lot to be done with data. We. Great ideas. We have a lot of requirements, and these things cannot be done with few simple SQL statements.
No matter how powerful your database is and how you can avoid joints and whatever, it’s just a minor issue, doing joints or not joints, just a minor issue. We are trying to express complex things, so why we are surprised, the end result is complex.
We use terms and I do it myself and I hate myself for it. We say, and it creates an impression. that the way data architectures are designed and built for most systems, it’s bad. It sucks, and in some cases it does, but in many cases it does not. It’s just very complex requirement or very complex set of requirements from thousands of different users, thousands of different requirements, and you just try to help them and now you are surprised that your environment is.
So that’s one piece. Even if we have standards, even if we have one platform and we run everything on it, which is just very hard to see for larger enterprises, you still have the last piece, so we want a lot. We have it .
Shane: I do a lot of work in the Agile space and one of the questions I constantly ask myself is, why are the patents we use from agile and data and analytics so far behind software?
and why do they kind of fit, but they kind of don’t. And it comes around to that complexity. If we wanted to reduce the complexity of our data platforms, we would force our systems of capture to store the data the way the end user wants to consume it. Not for data entry, but for data consumption. And if they could do that, then we don’t need to do half the work we do.
But the organizations are making trade off decisions. That’s why when I look at data mesh and I go, the Core four principles, they’re talking about, data is a product, federated data governance. They’re all core principles we’ve been trying to do for years. The challenges, if we push that data work down to the software engineering teams.
Where it should be, the question we’ve gotta ask ourselves is, will the stakeholder who’s funding that work that change? Agree. When the software engineering team comes to them and goes, so we finish that feature, we are ready to push the feature out to the customer to test it and get some value back.
but we can’t do it for three weeks because now we gotta do the second bit, which is actually make the data for purpose for our internal consumers. And sure as shit, the product owner’s gonna go push it. I want the value. So unless we can find ways of taking what we do as data professionals and embedded in the software engineering practice, so it just disappears.
This idea of a magical data library like any of the other development libraries they use, it is always this two stage process and it’s a trade off decision that every organization. If we look at the idea of standard. So I do remember many years ago we had the common warehouse meta model. The idea was, again, this shows you how old I am.
We had business objects, we had Cognos, we had Oracle discovery. We had all these really great tools back then, and we wanted a way that they could share their semantic layer. so the idea that I’d done some semantic definitions in business objects, universe, why couldn’t I actually share that, that semantic definition with the end user layer of discovery.
And we’re seeing that in the metadata world now with, open metadata, open lineage, and blah, blah, blah. But there’s no incentive for any of the vendors. To actually play nicely together. In fact, there’s a disincentive. I know that last year for the modern data Jenga stack, we saw some startups play with each other nicely, but that’s only because it was free money.
They didn’t need to compete and actually working together made sense because you build this little moat around everybody else. Well see how it goes this year, when they start running outta runway and they need to actually eat each other’s lunch. But in technology, we’ve had a few standards like sql, which even though there are variations, it has become a facto standard for a lot of data work.
What are you seeing in that metadata standard sharing domain? Do you think we’re gonna get there this time, or do you think we’re gonna fall back to the age old behavior of, I don’t wanna share my stuff because I’m opening up, I’m preaching the moat effect.
Tomas: So it’s an excellent question. I think that we have couple companies at least trying to play this game from really quite selfish reasons, but that’s good enough reason for me.
So we are actively participating in Nigeria. That’s ibm or started as an IBM project. There are many like big players involved, but one disadvantage of this, let’s say movement, is its complex. On the other side of the spectrum, we have open lineage or open metadata. In both cases, if you look like an open lineage, it’s just funny thing, it’s so simple that it doesn’t help.
That’s actually one of the really key challenges, we design the standard in a way that we basically kill every possible use case except for incident. , and it’s because people designing the standard are thinking about this use case in the first. So a runtime lineage, a runtime view, and potentially addressing issues when they happen.
And we are also actively participating in this, and we want to do all we can to move it to the next level. And I really truly believe in standards, especially open standards and open projects like this. and we dedicate a lot of resources. We try to help as much as we can, and in our product we have a lot of planned investments, how to really support the open lineage standard, as I said, as much as we can.
So it’s a really big thing for me, and I really, truly believe that we can make it happen. Maybe it’s not going to be perfect, but way better. Just think about today, we basically build integrations or we have partners building integrations with every possible data catalog. It’s just very typical use case.
We love this lineage, but we want simplified view. To be available to data stewards and users of data catalog. Yes, I love the case. I’m more than happy to integrate, I believe in open ecosystem, but now it means you go one by one and you are building these integrations or your partner is building the integration.
It’s like everyone is so tired of it, it’s just wrong. It shouldn’t be.
Shane: When we look at the pattern of an enterprise integration hub, when we said we have 150 different applications in our organization and we need to integrate the data for operational reasons, not warehousing reasons.
And we said, look at the , spaghetti, mad person’s knitting. It’s a peer to peer interface between hundred 50 and 150, which was just chaos. And then we looked to bring in these integration hub. And somehow they became expensive they’d be often here of a five or 10 million project in large corporates to put an integration hub that nobody ever used.
So it’s the same problem when we talk about metadata sharing. And it’s not even just about the metadata of what, table holds. It’s not just the metadata of the lineage of how that data moves. For me, it’s about the rules we apply. often we’ll get asked if somebody wants to move off our platform to another platform, how do we help?
And I go, well, we are not gonna give you the code we execute because effectively that’s our one thing that we treat as proprietary to us. But what we’ll give you is the rules that you’ve applied to execute that code. Cause we effectively operate in a layer system. So we have a language that then converts into another language that converts into the actual SQL python that we execute.
So we won’t give you the sequel of the Python, but we’ll give you the other layers. We’ve been experimenting with exposing that language, using a framework called Gerkin, right so given this and, , and then do this and so it becomes a quasi natural language. But the challenge is we can give you back your dictionary of every rule you have applied in a way that you can understand it, but you still then gotta re.
And it cracks me up when people go don’t go for proprietary platforms because they’re closed. And I’m going, that’s cool. But you know, you run on Snowflake, that’s not open. It does execute. Sequel. Yeah. You’ve now got 3000 D B T models. Well let’s call them blocks of code, cuz they’re not actually models.
And language annoys me and that one annoys me a lot. I
Tomas: agree with you. Liz annoys me as well. Very
Shane: much , just call it a code block, and so I’ve got these 3000 blobs of code. That five or 10 people have worked on that, actually five of them have already left to go and get their next job.
And somehow you are gonna move that across to Databricks. No, you’re not. Because it’s not about the execution platform, it’s about the replumbing of that logic. And so just like metadata where we don’t have open standards where we can share, it’s the same with transformation, we don’t have open standards where I can say, here’s my transformation rules.
Go and import them to another platform and run them if you want to, because that’s fair, it’s not where our, I. So I hope we move to a stage where we become more interactive and more open. Not convinced we will, but while we’re talking about that idea of this complexity, and you talked about start off with 500 to a thousand people, company as being small.
For me that’s large , especially in New Zealand. But if we look at the complexity we have in those really large organizations, those enterprises, the number of transformations they run, the number of platforms that they have to integrate, Often I say it’s just a matter of time, you might be a 500 employee company and you think you are tiny, but give yourself two years and you’re gonna have 10,000 D B T models, cuz you’re just gonna keep adding to them.
And it’s the same, we see a lot with teams building out lineage or catalog capabilities, so they’ll often think about the cost. I talked to organizations about the first time you hire your first aid engineer, you’re effectively making a half a million to a million dollar investment. And they’re. What do you mean?
I’m like, well, show me an organization that runs with one data engineer. There are, every now and again, a unicorn, but not often. So you’re gonna end up with four or five. Your fully loaded cost of those will be between half a million and million dollars a year. So to build your own, you are starting off with that, assuming you get a good team of five.
And they actually are incredibly effective. And we see that with lineage and observability of the metadata in that. Teams will often start off building their own . They’ll start by. Building their own pauses or scanners off the log files, grab some open source stuff and build that. Or I’ve seen teams do it where they are actually notating their code with a notation language and then using something like SPX to scrape it out and use, uh, U M L diagramming tool to show the stuff.
And they’re all good stats. That’s really quick and really easy to get some value. But when you start then going, okay, now you want to expose it to a thousand users, okay, we’re gonna plummet into a wiki or a static site generator. Okay? Now they wanna augment the context of it, or they wanna be able to click on something and see what it turns up.
Or we wanna do some alerting, like you said, that actually we’re seeing some stuff that’s happening in the lineage that we need to do. You start adding in all these things that are important to take. And you’ve become a product development company. And so given you are a product development company that just does lineage, although your definition of lineage for me is DataOps and you’ve been working on it for years with some really large customers and seeing all those use cases, if you had to guess.
For a team to build a standard set of capability that meets the normal use cases, team of three, team of five, how long would it take?
Tomas: Ah, that’s a very good question. I think for simplicity, we try to a rate companies based on employee count, which is not accurate. And we have couple customers with few hundreds employees, but their environment is just very complex and.
If I think about average company, average small company, nothing exceptional. As you said, five data engineers, one data platform running on something like Redshift, snowflake, BigQuery, whatever. Maybe Looker, Tableau. So I see couple different things. First one is it’s always very simple to do first 50.
in this case, we are not even talking about 80 to 20 rule. It’s just easy to start. And especially if you are computer science guy, you love the idea of building parcels and now you are finally using that second semester knowledge and you feel really great. So you build your first parser or you get grammar from internet and you built your first sparse and it’s playing SQL and it’s just super and then it.
More difficult. And then maybe you need to cover a store procedures. And now you are moving from s SQL to procedural language and now you have one calling, another, you have parameters and you have parameter context. And now you need to move from the second semester to fourth semester. And especially in systems programming.
So that’s one part . tend to see it as a very simple, and then you move from Plains q L and you start dealing with more complex things. We hate Excel sheets as data engineers and we don’t understand is the number one tool everyone who is working with data want to use, but you need to do lineage for that as well.
You can’t ignore it. So now you are moving from super sexy SQL par to wow this mess in Excel sheets with everything you can imagine. And yeah, actually by the way, our risk team, they’re actually using SaaS analytics because SaaS analytics is by the way, really good tool for risk analytics.
But SAS is , how do we do that? It’s macros and it’s , wow. It’s a completely new universe. So I. . We have spent a decade so far building variety of scanners. And of course, in your company you only have subset of those technologies, so you don’t need 10 years and 100 people to build that.
But even with the basics you have, if you have five people building it and you really want to do it in an automated way and you don’t want to do most of the stuff manual. You will spend definitely way more than a year with five people trying to even cover sql, more advanced use cases way more than one year, and not even talking about the other stuff that will take you even longer and you still have just the scanning part and now no one will actually use your metadata.
Maybe you will use it. For impact analysis and incident resolution. Okay, fine. So now you invested few million dollars just to have a tool for yourself and you need to keep investing because those Esq things, they develop you features. You need to cover those unless you are okay with not using them at all.
And it’s still just very basic plain use. So what if you actually want to see some value from those millions invested? So you start thinking about activation and real analytics, but there’s a completely new discipline you need to understand very hard stuff like network algorithms and graph algorithms.
And you need to really think about business value and about 10 and hundreds of different use cases. And you keep building and building and building and you need 5,100 people and , it makes no sense to me.
Shane: I think for those companies, it’s like you said, a bunch of smart people that can build smart stuff.
They have teams that are big enough that allow ’em to do it. I’m guessing cuz I dunno for sure. But it’s easier to hire people than it is to buy license. . The theory of sun cost comes in, but it’s not visible, a bit like technical debt with agile teams. The debt sits under the cover and most people don’t see it.
And for me, again, you talked about changes in technology over time. We’ve seen centralization, decentralization, We’ve seen mainframes and then client server, and then network apps, and then , cloud, and we’ve seen the kind of rise of duct DB and people like Mother Duck and that, which effectively is for me, another wave of decentralization.
We’re talking about moving the data down to the desktop, to the browser. Actually, most of the patterns we use for line observability at the moment are based on service. We expect that there is a server that is logging what it’s doing, and we can pass those logs, or we’re expecting there is a server that is transmitting the information we need.
As soon as we move into that new world where we are doing local compute, again, there ain’t no log right. We’re not gonna be harvesting the log files off my PC because it’s designed to not outdo that. So unless Duct DB and those tools are now transmitting the logs back to all the metadata back to a central server, which should do, we’ve got a whole new architecture to worry about
in the next generation that’s gonna hit us. We’re talking about edge node computing now. Again, that’s another set of patterns the cool thing is we’ve seen it before, so we know what we need. Just to close it out. I think for me. Language is important.
And when you talk about data lineage, actually you talk about a whole lot more and I go for rants about buzz washing where people take what they used to do and then call it the new thing. Oh yeah. We’re a data warehouse company, but we do data mesh even though we have central storage, central team, everything’s centralized, but we are data mesh.
I think you are probably on the opposite, is actually you’ve got, and you do a lot of the things that we talk about, Areas outside of Lineage, but you’re just going, no, we just do lineage. And for you, lineage is traceability of the data that turns up in a report back from where it came from,
it’s about that runtime lineage, that what broke? Why did it break? How do we fix it? It’s about that change management. It’s about seeing what’s about to change and what’s the impact of that. It’s about the architecture, what are all the moving parts and how they put together, and how do we see that.
That covers observability. It covers data ops, it covers a whole lot of stuff that the market talks about now. But for you, it’s that focus of the next piece, which is, okay, now I can track it. Now I can see it actually, what do I do with it? And that’s the next generation, is what action do I take?
How do I use this lineage, this metadata, to actually reduce time or reduce risk or reduce cost? How do I get value out of it, because I don’t just wanna store it. I want to actually use it so we get some value out of it. So for me, those are the important patterns that we should look at. We’ve done some work.
What was the value we got out of it? Apart from as a data engineer, I can see my own. Because hey, that’s got some value. Cause it makes you faster. But that’s not it, it’s what do we get for our stakeholders? What do we get for our consumers? How do we make their life easier? So just to close out, is there anything else from you, your point of view that’s really important when we talk about this whole world of data lineage?
Tomas: Yeah, I think I love the discussion. Thank you again for having me. I think we can continue for hours. , I would probably get back to that thing with standards and actually love how you see metadata and lineage and how you think about it. Basically, you try to make it part of your product and actually that’s something we really, really discuss internally, right now, how we can change the market so we help people like.
and people building software in big banks and big telco companies and healthcare companies, or small, midsize doesn’t matter, and how to help them to build software in a way that metadata and lineage is provided. I think it’s super powerful, super difficult and challenging, but it’s something we are becoming super passionate about because that’s how we can move from where we are.
to the future where we think about usage, activation, and analytics, and not just about collection. The collection sucks. There is no value in collection.
Shane: We collect it for a reason. We click rainwater cuz we wanna water the garden. We want to drink it, not just so we go, oh look, there’s a big bucket of rainwater.
So think about the use, think about the value we get out of it. Don’t just think about that first. Hey, look, that’s been great. If people wanted to follow you, get hold of you, see what you’re doing, seeing your latest thoughts, what’s the best way for people to get in?
Tomas: I think the best one is LinkedIn.
It’s just quite easy. I was one of those early guys, so I don’t have any numbers or any strange letters attached to my name at LinkedIn, so you can find me there easily. And I really try hard to be active. But as I said, like sometimes hard, we are a bit consumed by our customers. We are not so visible from multiple.
And we live in the bubble of large enterprise companies, and I just try to share as much as possible, especially because I believe there is a lot we can learn from those companies. We call them in ugly ways, but there is a lot to learn from those really experienced enterprise architects, data architects, and data engineers working there.
So you can find me there linked.
Shane: I’ll put your contact details on the show notes, but I just wanna reco what you said is, 10 years ago we did things like data modeling because we saw the value of it when we hit the complexity of the data that we dealt with. And then somehow as a profession or a domain, we lose it for a while and we just write code and blobs of code and big tables and no model, and then it comes back again, return on the data model.
We can take the lessons of our past, and we can imply them in new way, innovative ways to make it better in our future, but we shouldn’t lose them. We shouldn’t throw them away, but we shouldn’t implement the way we used to, we, it’s that balance of innovation, doing new things with old lamps.
So look, that’s been great. Thank you. I really appreciate your time and hope everybody has a simply magical day. Thank you very much. Have a great day. And that Data Magicians was another Agile Data podcast. If you’d like to learn more on applying an Agile way of working to your data and analytics, head over to agile data.io.