Observability – Raj Joseph

Mar 16, 2023 | AgileData Podcast, Podcast

Join Shane Gibson as he chats with Raj Joseph on his experience in defining data observability patterns.


Recommended Books

Podcast Transcript

Read along you will

Shane Gibson: Welcome to the “AgileData” podcast. I’m Shane Gibson.

Raj Joseph: Hey, thanks for having me, Shane. I’m Raj Joseph, CEO of DQLabs.ai.

Shane Gibson: Hey, Raj, thanks for coming on the show. We’ve got kind of an exciting one today. Well, they’re always exciting. But this one’s another exciting one for me. So we’re going to talk about observability and all the other abilities and data words that we can bundle into that one and what patterns are useful and what patterns aren’t? Before we do that, why don’t you just give the audience a bit of a background about yourself and your history of working this magical thing called data?

Raj Joseph: Sure. So my background started off from data engineering background. So started as an engineer back in 1998 with an engineering degree of the college, and then went through lots of startups and bigger FinTech companies, and slowly moved into product leadership for a data marketing company. So they’re big data platform with a focus towards quality. And one of the challenges there we did was trying to consolidate data from different sources, and trying to build a consumer profile. And of course, lots of duplicates, lots of bad data. So my life in and out for a long way has been data quality. And then later on started a services company in 2010, focusing on data management, and governance. And then in 2020, came into an opportunity to build a modern data quality platform with a focus of bringing in observability, and business quality together. And so that’s what I ended up doing. So that’s a brief profile of me.

Shane Gibson: Good. That’s one of the reasons I wanted to have you on the show is because you’ve been there done that. It’s a bit like me, where I’ve been a practitioner delivering data to customers, finding all the horrible things that are nightmares for us. And then how do we create products or services that fix that? And your concept around bad data kind of reminds me of one of my favorite t-shirts is “Breaking Bad” data t-shirt, Trifecta created it. I’m kinda tempted to make a new version. Because the one I’ve got, we’re so often its got holes. So maybe that’s what we’ve theme this whole podcast around “Breaking Bad Data”. So to start off observability data. Observability is kind of a new word coming out the patterns underlining it have been around for ages. But there’s always a risk with new terminology. It’s just buzzwords rather than things that have got meat and potatoes behind it. So for data resolvability, how would you describe that pattern if somebody asked you for it?

Raj Joseph: So for me, I think observability is a capability. Because when you’re looking at observing the surface, shallower, broader data, but faster processing. So when you’re having a bigger growth in data, it’s much more easy capability in a technology that could be adapted for multiple different reasons. Observability as always about and now with the data focus, we have this concept of data observability, even within data observability. For me, I try to see that observability comes can be used in multiple ways. And for me, it always goes back to the quality, and how can we observe a data to figure out some aspects of quality, because quality is a bigger, broader way more name and category in itself. And for me, observability gives some kind of measure of reliability. So the word that I can perceive and can see you when I heard that observability is capsule, but is a capability that I can use to measure the reliability of data delivery. So that’s one aspect that immediately comes to it. But there are other dimensions of things that we come in this day and age, with modern data cloud architectures, and things like that, where architectures spread across hybrid, and also on Cloud warehouses, where cost is a factor. So you can also use observability as a capability for cost. And then also, if you want to measure some of the business quality, in a way, you can also measure for this, there is any anomalies that you can check in, etc. So, but more around reliability is where I kind of tend to focus.

Shane Gibson: So for me, I agree. Data observability is a bucket. It’s an overarching term, but that like a Cloud Analytics warehouse, that’s an overarching term that has multiple parts to it multiple patterns, each of those patterns have value and we’re kind of saying this is a new category. There’s a new group of patterns that we want to put together. One of the ones I want to pick up on that you talked about out of this idea of scale. So if we think about if our data platforms are data systems or data we work with as tiny, and not tiny a number of rows, but tiny number of columns or data sources or users, so we’re in a really small scale, a human can observe. We go and grab one table from one system, I can observe that, if I’m only running it once a day. I can go in and observe it happen. If I’ve only got one user, I can observe what they’re doing with them to see where the problem is. So that’s all good. But as we scale, as we add 100 different systems, as we add 1000 users, as we had 3000 tables with 60,000 columns, we bring in SAP. We can’t observe any more as humans. So we need the machine to help us and for me, that’s what observatory is about, is the machine doing as much of the work as it can. Because it ain’t. even with ChatGPT, fully magical. I don’t have a magical observability hat where I just dump all my data in and I go to sleep, and don’t worry, and it tells me, but it gives me headsup. So I like the way that you broke it down, we can look at observability patterns for quality. We can look at observability patterns for cost. We can look for reliability. And we can look for anomalies right there. They are really good for areas that we can do. So what’s your favorite one, which one you want to do first?

Raj Joseph: I think my background has been in most of my life has been in data quality, I would just focus on quality as a place where to start and it’s a broad category in itself. Because if I asked you a question, what is data quality, you may come up with a definition than what I may come up with. And most of the time, what I have understood in the industry is it’s really in the eye of the beholder. That meaning the roles and the responsibilities defines what they’re looking for. When you’re talking to a data scientist, and ask them data quality, they are more talking from a fit for purpose, meaning when they’re building analytical models, their needs of that particular model may be very different than what they might have done a day or a month back. So their needs are very changing, and is based on the business strategies and the tactical things that they do in terms of self-solving those strategies. While if I talk to a data engineer, there’s maybe a little different, maybe they are more looking from reliability of the data in terms of making sure that their data reaches from one source to another target. And then also their pipelines are processing as it and then try to prevent as much as they could before it meant. So that’s where I would start observability with a focus to its quality. And then it’s a break about a product topic itself, based on the roles and responsibilities of the user.

Shane Gibson: Yeah, I agree. I think if you take a persona view of it, you get different answers. So if you say to a business consumer, what do you care about quality to data and they go, I need to trust it. If you told me, I’ve got 100,000 customers, I just expect that number to be right. Why would it not be right, it’s really interesting in the data world. We have this habit of just giving data to consumers and getting them to find the problems. It’s a car with only three wheels. And we ask, is the other wheel really, the fourth wheel really necessary? And then the consumer hops in the car, it falls over scrapes on the ground. And we’re like, “We can fix it”. So for me, that’s what they say. Whereas you’re right if you’re an engineer, or you’re in the plumbing, then you care about some other things. One thing I find really bit of a struggle for me is if you look at DAMA and data management and data quality has been around for decades, there’s always the abilities. There’s always six abilities for data quality, buckets or categories you’re gonna look at. And for me, I’ve always struggled with a pattern or a rule and where it fits, so the one I had the other day is, let’s say that we have a table and the tables got a foreign key on it. So each of the values in that column are meant to be unique. They’re meant to be only ever one. And we bring that data in, and it fails that test for whatever reason is that a uniqueness test that failed? Is that a duplication test? So those categorizations that we often hear about each of the rules that we could use could go into any of them? What have you found, when you kind of say, if you had to categorise the quality as being high or low for a certain thing? Do you use the standard six from DAMA, or do you think about it differently?

Raj Joseph: I mean, I know the six quality dimensions you’re talking about. So I was at some point as a practitioner, and then as a preacher of data quality and all those principles. I was madly following all these dimensions without understanding why and what and etc, to a specific level. But what I really more and more understood is it’s more than the measurement. The purpose actually defines the measurement. So measurement without a purpose is messed up. So the six categories are very useful when you do get into some kind of defining the purpose or if you don’t have a structure if you don’t have a framework of how to measure that kind of gives it. But what I have seen from working with different clients and etc is each one has a different level of expectation and governance policies that they can manage. So for example, I’ll give you one example. This is a bank in Canada. So this particular bank, they don’t look into quality, the same across all the data. So first what they do is they define data in three levels, level one, level two, level three. Level one is good, I have the data, that’s great that if it is there, it’s good. Level two is something that is used for reporting and some kind of data science model. Level three, it is used for KPIs business metrics, because the leadership is going to be looking into it could be example could be its revenue metrics across products, while your level two could be as a product name. And level one is gonna be, “Some flags that may be sitting there ones and zeros of sorts.” Something that needs to be turned on and off, a feature for example? So for them, when they say level one, do I care about quality? No. Do I care about reliability meaning the data? Is there or not? Yes. If for some reason, if there is a big data gap, or data leakage? Yes, I want to look into it, maybe a system is down maybe something has happened. So the level of detail of how we need to measure and the quality is based on the criticality of the data derived from the business use of that. So that’s how they do it. So level one, level two, three, and then so as you go up, your six dimensions of data is much more higher needed. And it can be even beyond that. It could be even like a subjective, these six dimensions are more quantitative. And if you want to even go subjective meaning the data is there. And the quality of the six dimensions are saying great, but nobody using it then what.

Shane Gibson: It’s the boil of the ocean problem. It’s the same problem we have with data catalogs where we bring every SAP table, in and column, and then we got 60,000 columns. Nobody’s going to catalog all that, and they shouldn’t. We can catalog what we care about. So I think that’s good. The way I kind of think about it as blast radius. If we have a problem with the data, whether it’s quality, cost, there’s anomaly or it’s not reliable, what’s the blast radius, what’s the impact to the organization, with that data been wrong, not being trustworthy, and then we tailor our patterns because of it. And that’s why often for me, we can figure out a small set of things we want to focus to begin with, because we understand what data is really, really important. Its normally the data we touch first. Not the first better data we bring in. But the first data we kind of expose to a consumer because they’ve asked for it. And then the second thing is often, we have a problem. And then we have to go and observe and manually figure out what the hell’s went on. And then we have to go and fix something. Again, that’s when we need to bake in some observability. We have to say, “Let’s just make sure that if that happens”. Again, we identify it. And ideally, we go fix it. So it can never happen again. But we definitely want to be able to observe it if it does and be notified. So again start small, bake in the simple ones, go figure out what fields shouldn’t be null. Real simple ones, do that first.

Raj Joseph: Exactly. Those kinds of checks is what I call as wide and shallow validations. Like, null checks, blanks, empty is uniqueness, or even the number of unique or data distribution in terms of statistical evaluation that you do, kind of checking more in a wider, faster, bigger aspect of amounts of data more shallow validations, you’re still under more at the surface level, you’re not going deeper, you’re not having any narrower validation, like what you may do in business checks. And I think that’s right at that level one zone, we are talking right not boiling the ocean. But as you go further down, that’s where they start defining what is critical. And maybe when it comes to business quality checks, I always tended to see your point impact or looking at their business impact. The first thing is, what is the business impact? And it’s not an easy answer to solve, but the way I create this business impact, meaning what are the KPIs so the KPIs can be easily translated to the actual data that is being provided to us to make the KPI metrics. And so, I can look into the metric stores or the business metric stores and from there I can connect to the tables and the columns and then kind of drill down into it, this set of data are much more business impacting data versus all these other non-business impacting data.

Shane Gibson: And for me, usage of the information is something we should be monitoring, we should be measuring. We should be logging, we should be measuring, we should be monitoring how often a certain piece of data is used and who’s it used by and that helps us with our blast radius. So we can say, “Actually, if the data here becomes untrustworthy, we already know who’s using it”. We can communicate with them, or we can understand is that affecting one user or 1000 users.  We can’t quite tell it that affecting the one piece of data that that one user happens to do manually, that drives the whole profit of the business versus the 1000 users that are just in the factory part of the process or not, but you often don’t see that usage logging as part of the observability. It tends to sit into another category. So for me, it needs to be observability. If we’re saying, we need to be able to observe things that are important at scale without a human doing it manually. That’s our definition. The other one that’s really interesting for me as in couldn’t, you kind of talked about it a little bit there is around costs. So in the old days when we had on-prem databases, we had an Oracle or SQL Server, we could run a hell of a lot of observability tests if we wanted to. Because there was no additional cost of hardware, there was just a cost to the consumer, because we’re taking all the CPUs, their reports would run slow, or our ETL would run slow. In the new cloud world, when we add this new observability capabilities, this new workload, there is a cost, we start paying for more credits, if you’re using one that’s credit based, or we’re more compute or more CPU, more slots, or depending on which technology you’re using. But you actually have a large overhead on that platform. And what we’ve seen is actually running our observability stuff actually costs us more than running our transformation stuff. And this unless you really think about the pattern, and I’ll give you an example. So we might say we’ve got this table that’s got some categorical variables. It’s got some regions. And under each region isa bunch of cities. And we know that there’s a mapping, we know that in the source system, someone’s going to change that mapping. They’re gonna go remap a city or a new city turns up or something’s gonna happen. And we want to capture that from an observability point of view, we want to because we know that, for some reason, that piece of data that mappings use for our financial forecasts, it’s going out to our shareholders. And so it’s a high risk piece of data for us. A lazy architecture basically says, scan those tables for distincts. But what you’re doing is you’re scanning that entire table every time. And that’s an expensive compute. So you need a pattern where actually, what you want to do is break out that unique set of values into their own small tables. And every time a new piece of data turns up, compare that piece of data to your reference table and I’ve seen this one before. And now you’re just you’re only processing new rows. You’re reducing your cost of compute. So you have to bake that in? Do you find that? Do you find that actually observability, quality, anomaly detection, reliability, cost monitoring, usage logging, they are all highly expensive compute in the new cloud world?

Raj Joseph: Yes, you’re right. I mean, because if you see, the adoption of cloud is growing, and it will grow. No question about that. But what you’re seeing, it’s very interesting as the adoption of cloud is also a little bit slowed at some point, because of the cost factor. I mean, we have folks using all this modern cloud warehouses and cloud lake houses and things like that, and then they move the data. And then every time, they query, there is a cost you want to be processed even faster, then you need to pay more in size of the warehouses or the size of the compute power that goes around it, which makes sense to a degree. And then to your point, you cannot be observing the whole ocean, because you’re kind of like actually computing more. And then therefore the cost is more. And so in some business transactions, we’ll come across if you’re spending for cloud warehouses, or lakehouses is $1 million or $2 million, you’re absorbingly spend needs to be a fraction of that. It cannot be like 300k or 500k, or massive, which is like, no questions are value of that itself. And the way to do that is, rather than spinning up and going across all observability metrics going into the table level, you can use some of the logging mechanisms of that. Like for example, if you take snowflake provides your metadata log on in terms of volume fresheners is not much all this modern data warehouses provides time travel mechanisms, so you can see the snapshots of time and when it got updated. So you can easily retrieve some of those information without doing any kind of complex query logic. So looking into that, but the same thing when you have some other legacy databases or even on premise system, it does not scale to that level, which is ok because it’s not cost computed based. But however, the point I’m trying to make, whereas, when you start with an observable data, you don’t necessarily do all just maybe stick with few dimensions of data. So we always recommend volume, which is a row count. And then we kind of look into the schema, which is more the column counts. And then we look into freshness, which is kinda like the timeliness of the data when it got updated. And then if you want another fourth metric, we also look into duplicates. The duplicates could be based on primary keys, or it could be a composite key on a particular table or asset. So we just kind of stick to that as every asset. So you’re not doing too much deeper checks. But at the same time, some kind of higher level check and statistics that kind of gives you some information. So that’s one. And then the second aspect of that what we do is, we look into usage information. Who’s using this table? And what roles is it a system? Which is kind of spending for no reason, or is it an actual user from a consumption standpoint? At least that can tell you how relevant or how impactful in some ways, just because using doesn’t also mean it’s business impactful? Because somebody may be running a query for no reason. At least it gives us another dimension of data to look, and then kind of combine and then go into a methodological process around defining some criticality using these KPIs and other things that go around.

Shane Gibson: Yes, I agree. I think the technology choices you make are part of your stack actually drive how you manage the observability cost. So an example was because we’re all based on BigQuery. And we have a choice between slot based and size of query, effectively, we’re able to profile the data within the same query at relatively low cost for us BigQuery. So what we do is we kind of profile the data and then store the data when we process it and store the profiling information next to it. If we were using something like snowflake, where you’re actually paying for the size of the compute, then actually, you want to be a little bit careful.  Because every time you profile everything, you’re going to do it. Whereas if you’re using something like a lake house, we you using iceberg as storage, then you probably want to profile the data as it comes through again, but this time, you’re storing it as another object. You’re storing it next to the data. So again, those patterns are slightly different. I think one thing that really interests me if we think about cloud databases, they inherited data, but the on prem database patterns. So they typically have an information schema, we can go and say, “What’s the name of the column? What’s the column type? It’s a numeric or character?” What’s interesting is they don’t do that with profiling data. They don’t store this columns got nulls, they don’t store the standard deviation of that column or the maximums. And so it’s gonna be really interesting to see over time, whether they actually capture that, score it and see if that’s up for free, without a cost to compute. Because actually, as data users, that’s what we want, they know what the data looks like, because they’re storing the bloody stuff. So just give it back to us and charge us.

Raj Joseph: You already see, I think all these modern warehouses, and lake houses. It has already going into your path of introducing some observability metrics, as part of their stack, when it’s inevitable. And I think they have to deal with. And because I think that will make more value at one as a platform as an offering from it. And also, it gets a little bit touches into the data management aspect of it, because it’s another big area of revenue, but also more adds value as a platform offering. So I would see more and more of these vendors trying to build observability metrics into their platform by default. I think the challenge we’re in business has is, I mean, it would be great if an organization has only one technology, but that’s never the case. So they use a combination of technologies. I mean, I’ve seen like people using both snowflake and data bricks, they compete with each other, and then also use it for different reasons. So I think that’s kind of the reality of it, even though these vendors, some vendors may incorporate and still, others wouldn’t. And then your ecosystem has multiple technologies. And then how do you measure that as the data moves from one place to another place, and then also gives you a bigger whole vision of where the data stack is, from a health standpoint? And from a reliability standpoint, I think that’s pretty cool.

Shane Gibson: I think it’s a really good point, you raise around the different technology stacks, but also about the end to end process. So again, we’re still at the stage where a lot of the logic still gets landed in the last mile. It gets in the Tableau or the Power BI or Looker Studio reports, or their semantic layer. And so we still have the ability for the quality or the reliability of the data to be broken at that last mile. I liken it to a kitchen. We measure the quality of the goods coming into our storeroom, we measure the quality of the process we use to bake in the kitchen. We measure the quality of how we serve it up to that little window. But we don’t measure the quality of the waiting person who picks it up and then spits in the food before they deliver it to you. So a lot of the observability capabilities are just up to that serving layer, but they don’t actually then worry about what’s happening. And that last mile, and so therefore, we can degrade the quality. So are you seeing that? Are you seeing that market starting to move that people are starting to worry about what happens to that last mile of delivery before it gets to the consumer?

Raj Joseph: So that is a lot of innovation that is still yet to happen in the data world. I think we are in I mean, even the adoption of cloud is still in the initial early stages of it, we haven’t even seen the full revolution of everybody in cloud or everybody has adopted to it because I mean, we see cost as a factor that this needs to be optimized and contained the governance around a cloud in terms of privacy classifications, the end to end visibility is still a struggle, because all businesses today have some unknown premises, some legacy technologies and modern technologies. And that’s the reality of it. And in order to get those into it with split, it’s not an easy problem to solve. And it’s very challenging to and then also, like, each vendor, and each technology gives some part of it, not necessarily the same in terms of a metadata. So you’ll get some time travel Hana, mental health data logs from some of this modern stacks. But if you go to the traditional view of the past rows, lots and lots of logs do get to that point of data. So I think there is a variation of that all of this makes it more and more complex. But I think rather than trying to get all of this visibility’s at least if you can define some metrics, in each one of those stop points or that stops, then I think that would help to get closer to that. In reality, I don’t think any company can get 100% visibility on the same talent. But the more they can get both of them to be smarter and faster in terms of their data driven strategy.

Shane Gibson: I think that’s a good point, actually. So people often underestimate that every time they add a new technology, or a new moving part to their data stack, they’re actually increasing complexity, and therefore they’re reducing observability unless, they can apply those observability patterns to that new component, which they often don’t because it’s a last mile when it’s hard. But what’s interesting there is you bought a new one. You talked about privacy and classification of data. And again, that’s a form of observability. Am I holding credit card numbers? Am I holding licenses? Am I holding people’s names I need? Again, if we think about observability, as a way of getting told by the machine, something happened at scale, because I can’t go look at it. I’m not gonna go scan every column visually and go, that looks like a name. I want the machine to do it. So we got to add privacy and data protection, observing what’s happening in that space into our observability bucket, if we’re saying we want observability of the whole system, don’t we?

Raj Joseph: I think you’re 100% right. Because when you look observability, it’s more this blanket catch all term. And then you add data, still it’s very generic, very broad. But the more context you put on top of it, and the context could be either your privacy, it’s very focused on privacy, or it’s focused on cost, or it’s focused on usage analytics, or it’s focused on health, or it’s focused on reliability in terms of as the data moves from pipeline, or it’s focused on business. Now, you’re kinda like having lots of different sub branches, that kind of goes from that and getting in its own division. So there is a privacy regulation that itself is a bigger, broader industry. And you can use observability in some kind of capability. So the application and the impact is more, but I think where it will become more and more meaningful is taking one subject like quality, data quality, or a business quality, and then trying to see how you can use leverage observability as a capability within the constraints of what you’re trying to do from different personas from a leader from a business consumption standpoint, data scientist, data analyst, and then also from a data engineering standpoint, looking at all the three different roles and how they see the data quality, and then mapping that observability would be really critical. And that’s the same thing for privacy and classification too.

Shane Gibson: Again, I think coming back to the “Idea of Persona”. As a persona, difference between consumer and say an engineer, I want to observe some different things. I need different outputs to tell me whether I can trust it, can I trust that my DAG ran versus can I trust the number you’re giving me that tells me how many of those we have. And when we come to that we start talking about language. So we think about data freshness, is the data up to date? From a technology point of view we’ve always talked about SLAs and SLOs. So service level objectives and service level agreements, and you talk to a business user, somebody that’s not in the technology domain about those terms, and they just look at you blankly a SL what? But if you talk about freshness, is the data fresh or is it stale? They kind of get that. Again, I always come back to baking and cooking for some reason, but my bread is fresh, my bread is stale? One is good and one is bad. And so they have expectations about how fresh it should be? Again, it’s about use. So some data doesn’t have to be fresh every day, once a week. So some data has to be fresh today. And so if we think about that terminology, that language we use, we’ve seen a new term come out last year and this year, and it’s all hot, and maybe buzzwashing, maybe not, so it’s data contracts. And I look at them and I go, whenever I hear a new term come out, I look at it and I go, what’s the underlying patterns? What are the patterns that are available? What’s the context where I can apply that pattern? And what action or outcome does that pattern give me? And so if I think about data contracts, if I think about something simple like schema mutation, I have a table coming from a source system and has 10 columns. My contract has 10 columns, and a 11th column turns up that contract has been broken. So a test or a pattern for observability of the number of columns that we’re expecting has changed is variable. Again, I’ve only got one table coming in, I can go and eyeball it. If I’ve got 1000 tables coming in, I can’t. So the machine needs to observe that for me. But there’s been around for ages. That’s not new schema mutation, then we go on to some other patterns around it, they go into the context of the data. Has the data types changed, has the volume of the data changed, has the relationship between two columns changed? So we start getting to quality tests in my head. And then there’s that Nirvana one which I actually have a contract. I have a policy that’s written in code, and that both sides are complying with it. It’s actually driving everything we do. And I haven’t seen anybody actually do that yet. But from your point of view, data contracts, first question, buzz washing or not? And then the second question, where’s the value? What are the patterns?

Raj Joseph: In every innovation or every new principle that has come has been a buzz at some point of time. And if you look, the life cycle just goes up. And then it just goes through this maturity lifecycle, and then just kind of comes, this is high, and this is actually some value to it. So I think there is a lot of value that we can get in terms of data contract. I mean, if you really look what’s happening in software observability, is now turning into data observability service level agreements is becoming data contracts, data level agreements. And the concept of centralized governance is going more towards decentralized from a data mesh principle standpoint. Also, the self-service data platform as a concept is making, how do you share the knowledge and the level by the SMEs was trying to consolidate and have business owners share that because it’s a too much responsibility to be asked from a business they own when they are not even the ones who are producing the data. So all of this in a theory and in the way it is constructed makes a lot of sense. But in terms of certs, I don’t think it’s a buzzword, but actually where it really as organization, or a user of the platform of these concepts struggle is how do we implement it? Because these are all like, newer concepts, even observability is still new and emerging, because it hasn’t reached its maturity in terms of how will it can be used, because you have alert fatigue. If you turn on observability across your whole ecosystem, it’s gonna just look for anomalies everywhere, some may be meaningful, some may not be. And as far as I know, when you get like 1000 emails, most of the time, you’re not going to be looking into that. So there is a maturity component that sits with that. And same thing with the data contracts, I kind of feel there is a definitely a value because now, if you look into data quality, or data observability, or any of those things, both the business and engineering, which is data engineers and data scientists, or even the business leaders, they need to collaborate together and they have to see it in some kind of common way. And I see data contracts as that. It just kind of gives them to meaningfully commit to both from an engineering standpoint and also from a business telling like this is my expectation. Otherwise, what happens is the life of an data engineering has decided because they think today it’s great data. And then tomorrow, you will hear from the data scientists, this sucks, because they need changes. So that kind of allows them to bring and bridge the gap and promote more collaboration and be more productive. And can also tie back to the business impact you started on early and driving towards a business outcome. But the bigger broader question is, what are the categories of data contracts? Again, the implications are more because it could be privacy. Because I can have a system so that means it needs to be masked, it needs to become unwell. And then do it also, from a cost standpoint, it could be, or it could be from a reliability and accuracy standpoint, from a data quality still.

Shane Gibson: Yeah, that’s a good point, actually, is that the contract can include many things, and it should, and then you have to decide what you’re gonna do about it. So you could have a contract with a data provider that they won’t send you names because you’re storing them in a different way. So you need to monitor that contracts not broken, and then you need to decide what to do if it is broken. One of the things I love doing this podcast for is, as I talk to people, and then something just goes bing in my head, and that just happened. So people have talked about data products for a year or two, and I’ve worked for the last 10 years around this idea of an Information Product. And the way I describe it is, we sometimes produce data depending on the persona. The consumer, so data scientists typically wants data, not information. But our business consumers typically want information. They don’t want a dump of data. They want to get an answer. And so we think about data contracts, and we think about a data value chain or data value stream where data is produced on the left hand side by a source system, a machine or a human, we do some stuff to make it useful. And we give it to the person that actually wants to consume that data or that information. The data contracts have all been focused on the left hand side, we focused on our problem, and we haven’t actually focused on an information contract. We haven’t focused on our consumer about what good looks like for them, there’s a bunch of tests, but there’s no contract. And then if we bring that lens of cost into it, and that’s just freakin awesome, because now what we can say is, “Look, here’s is the base contract. Data won’t have any nulls. And it will turn up once a week”. If you want to increase the level. If you want us to do more testing around it, you want us to get it to you faster, then the cost of their contract is going to go up. So there you go, I reckon 2023, we just need to trademark this, information contracts as important as data contracts, because we should actually have a contract with our customer or consumer. Because the things we do to make that data better quality to make it cheaper to produce, to identify the anomalies to make sure it meets the freshness requirements to ensure that it’s retaining or complying with the privacy rules that we have. Each one of those has a cost and both human terms for us to build it and implement it and monitor and the system costs in terms of executing and the observability engine. So, for me, that’s information contracts, there we go, heard it here first.

Raj Joseph: Definitely. I mean, if you’re on the same thing, information contract could also have sharing agreements in terms of who access this data and things like that, from a usage and accessibility standpoint. So because of so much choices in today’s technology. I mean, if an organization is starting today as a new business, it’s easy for them to adopt some of the technologies faster. But as they grow, I think I can I call this when you’re small, and you have a data which is created by your own applications, you have a higher degree of quality. But as you grow, your data is going to be coming from outside partners, suppliers, and systems that you don’t necessarily manage. And now you have what they call us inorganic data. But it’s very essential for your reporting, analytics, and whatever you’re doing in terms of your business outcomes. And I think that’s where all those challenges comes in. Because in an real environment, your ecosystem has so much different technology. So when different nuances to it, data coming internally, data coming from different stakeholders, which don’t necessarily contain. So I always take an aspect of don’t overcomplicate things start simple, be pragmatic by putting a process in place and the process needs to be very simpler, because I think when it comes to people process, technology and technologies is the easiest piece, process can be more readily. People are the most challenging piece. Humans are really nasty. I’m still there. So I always take a notion of try to keep it simple process, something that people can understand relatively easy, and then slow on do you have something really simple like maybe metadata level agreement, and then expanded to your aspect of what you’re saying like information and even enough information that could be one specific metrics that you will really look at signing up. And then third, it could be cost, and then fourth equation. So if you have a plan like that, then you’ll feel good about some of these other options. If you try to do everything all at once, it’s kind of like a never ending project, or you will not be able to measure the milestone of success as what I say.

Shane Gibson: And again, you’re going to boil the ocean. So I actually have a different view on first party data versus third party data. And my view is that data we acquire outside our organization that third party data, often is of better quality, because we’re typically paying for it. So the provider who’s sending it to us actually cares about the quality because we won’t pay them if it starts being crap. Our first party data is typically managed by our people, and our people get paid regardless. So the quality is not that important often. They don’t get fired for typing the wrong data in the wrong box. The classic one is, you have a system where you have to type in date of birth, and the system allows you to put in 1900, even though we know there’s nobody alive at that date anymore. And therefore, when people were busy, when they’re trying to get through those stupid screens, and they’ve got the customer on the phone, they just whack in 1900. Because it gets them to the end of their goal really quickly and they don’t get fired for it. We might find it in there, somebody will go ape about it later. But now we’ve got to retrospectively go in, and the person who let that field be entered that way. And the person who entered the data, there’s no consequences to them apart from maybe a telling off. So again, I find that first party data often is of less quality than stuff we are buying. And then we come back to that idea of boiling the ocean which, again, you just talked about really well. And I take it to the next level of signal versus noise. So we started off, we’ve got 10 systems coming in, we bring in some observability capability, we build some profiling rules, we build some quality tests, we do a little bit of cost modeling to make sure we know when things are gonna go a bit weird. We do a little bit of anomaly detection, and maybe a little regression on our load stats to go, these tables look like they’re outside the norm. We bring in some freshness monitoring, we’ll just tag each of the tables with the day how often they should be updated and will monitor that and bring in some data loss protection, automated privacy, stuff that flags data as high risk, low risk, medium risk for being a name. So now what we’ve got is we’ve got 30, 40, 50 things alerting me, it’s doing an essential thing. So I’ve got 100 high quality, high risk tables that we’ve said, “Look, that’s actually what goes into our board reports”. So those 100 tables are the key tables we care. I get up in the morning, I go and look at my little dashboard. And I’ve been notified on 36,000 alerts, what do you do? How do you deal with that signal versus noise?

Raj Joseph: I mean, you hit the point on alert fatigue. I think that’s where I can say like, “Don’t overcomplicate things, you don’t need to observe everything as they start”. And then as you observe, or as you measure quality, or as you start looking into privacy, compliance, and things like that, I think even if you can build a percentage of or in an application, or some kind of specific context within your bigger ecosystem, and then build a confidence, that’s much more easier to observe, and then train the people to look and make it more sharp. So I would say, it’s a more of a strategy how we are deployed, it’s not the limitations of the tools and technologies, it just like one, the people have divided what is being shown in the front of the dashboards. And if that is not scaling, even within one application context, screw it. Where do you use it? I mean, you’re gonna be spending hours and hours standing. Most of the time what happens is they buy all this toolsets. They don’t use it, it just sitting there. I mean, this is very common in banks. Like, if you go to a bigger bank, they pretty much have all different types of tools. Because different lobbies, they want different products, different things. And then you pretty much have like 10 different catalogs, and different quality tools, each one doing different things, then the value is always questioned at the end of the day, because they don’t necessarily start off with a focused approach. So that’s what I would say, I don’t even want to start with 30 metrics as to 5 metrics. If you can measure five metrics into an application in a meaningful way and build confidence. Now you can replicate that same mechanism across multiple applications and across the whole organization. Now you go and add another one. So that would be the steps and the process that I would follow.

Shane Gibson: I think it’s not a solved problem yet. Again, if we think about the end to end lifecycle, it’s always interesting for me that if we have a dashboard or machine learning model recommendation, inSalesforce, it’s almost virtually impossible for the consumer of that piece of information to understand the quality of the data that came from, because the quality information, the observability information sits in another system. So they have to go and look somewhere else, or they have to go and login or they have to look somewhere else, we don’t embed it right next to it. And as you talked about that organization that had level one, level two, level three. I remember, must be 20 years ago, we were doing a project. And we actually classified the end user reports and three colors, we had bronze, silver, gold, and that was all based around the process we used. So the gold reports were the ones that have gone through, this was back and waterfall days, our extremely long, extremely painful, 6-12 months development lifecycle. But we inferred that because it went through that process and went through so many hands, the data and that report was of higher quality. The silver was it went through some kind of ad hoc process, but there was a peer review. There was a community review by subject matter experts. And then the bronze one was just an analyst hacked some stuff together and pumped it out. And so the end consumer could look at it and have a feeling of quality. But it was all based around the process. It wasn’t around observability. It wasn’t around the data. So again, I think it’s the start with something simple to go and say how do we actually give confidence to the person consuming that information or that data around that observability? And coming back to those key things that we talked about. So we talked about the quality of the data, or can we trust it? We talked and observing that we talked about observing the cost of that data and the platform and what we’ve done. We talked a little bit about detecting anomalies. How do we the observer really can tell us that something looks weird. We talked about freshness. Has it been refreshed, as I would expect or as I need. And then we talked about privacy. Is the data meeting our privacy things? So again, got a couple of extra ones from when we talked at the beginning to 15 minutes later talking about the end. From your point of view, anything we miss? Any patterns that you go, “Oh, that’s the one thats my goal”.

Raj Joseph: I think there are certain aspects of observability, which does not cover. I mean, for example, when you’re talking about different applications, you can have a little bit touched on it, like foreign keys and joints, and then checking the data integrity. I mean, you can do the count and you can do an anomaly, and you can do some manual thresholds and things like that, but it can get into you need to look into the data in itself. I see something is happening. But then I have to go look into the data. And now it’s kind of look across multiple applications and data, and then try to do more of a business process checks, which kind of gets into this process things. So I think there was some more developments that could happen. Definitely, as the industry matures, I think these are things like root cause analysis and trying to look into observability from a process standpoint using data, there are some things that could happen to so these are areas where we are looking into it and trying to say we can do something at the metadata level, they can do something towards reduce that noise. So the signals are more valid. And then we can also provide some guidance to the businesses in terms of start slow, crawl, walk, run, versus right to do everything. Some limited SLAs on a higher level from a data contract standpoint. And then probably this federated governance or decentralized ownership from a data mesh principle standpoint. But I’m super excited, because cloud hasn’t fully yet seen its full peak, we are just in the start of it. We are already seeing ChatGPT evolution in terms of “We are still struggling to govern and data management. As your data grows your data management needs to be even sharper, and more agile, which means innovation is waiting and more to happen”. So it’s kind of like fascinating to be in this time of the age just to witness all of this stuff. So more to come, I guess.

Shane Gibson: Yes, I agree. I think the key for observability right now is there are a large number of patterns that are being shared, these patterns that we’ve had for 20 years, and there’s a whole lot of new patterns, that technology and process and people have enabled. So if you’re starting your journey, don’t worry about the technology yet. Worry about the people and the process, find some patterns you think may have value, try them out in your organization with your context and say, “Are they valuable? Are they worth the cost of implementing from a technology point of view?” And then once you start getting into a scaling problem, the machine needs to do the work. Once you start getting too many of those things, you’ve got to deal with the signal versus the noise. And that’s the next problem to solve. First day, just make sure you check for nulls.

Raj Joseph: It’s simple what other problems to worry, like do you need to call a data engineer or do you need to call the pipeline, or look into the pipeline?

Shane Gibson: Nulls always bite you in the bum? I’d love somebody to do an actual research survey, using data from lots of customers around what are the number one observability tests?

Raj Joseph: When you go into this data is sometimes it’s very funny like engineers, as much as they are super smart. Sometimes it’s also like, it’s just the way it’s being constructed some, sometimes you may have SSN, or some security numbers, which is in a VarChar, like is text. ? Just because it has hyphens, they think it makes it hyphens etc. So I have seen in an organization, the same field SSN, sometimes in numeric without hyphens, something in var char text with hyphens. And then and sometimes it even has spaces because some application trims the spaces and just puts it for whatever reason. And now you have trailing and leading spaces across the data. And so now you’re kind of having all these data quality issues. So nulls, blank, some basic checks will go along way what I can offer you.

Shane Gibson: And I think it comes back to at the moment. Observability is the first sieve of as the machine telling you to look over there. And then the second sieve is still a human, we haven’t automated that yet. A human still has to go and look at it and observe what’s happening at the next level and go, I can see the problem. I know how to fix it. Maybe ChatGPT will give us that, and I don’t think it will.

Raj Joseph: You’re gonna ask one day the question with ChatGPT is my data reliable and accurate?

Shane Gibson: I think once you ask it for the bio of somebody, and it actually doesn’t lie to you. What do they call it that hyperventilate? It basically make stuff up. And there’s lots of cases that are coming out now where people go, tell me my bio. And it’s like, “You went to Harvard?” No, I don’t. And they’re like, “No, it doesn’t”. So once I solve that problem, maybe we can trust ChatGPT to observe our data for us and tell us whats going wrong it’s an exciting time. We’ve got some cool technologies that hopefully people and process will catch up to,  but its people and process first.

Raj Joseph: Like, that’s funny, I just went and just type that made a reliable and accurate ChatGPT.

Shane Gibson: How to tell if your a geek used to be the t-shirt you wore, the TV show you watched. Now how to tell your a geek is that your other browser window that’s permanently open is ChatGPT.

Raj Joseph: It felt like as an AI Language Model, I don’t have access to your data.

Shane Gibson: Well, that’s the other vanity metric, isn’t it, if ChatGPT can actually do a bio for you, even though it’s wrong. It still means you’re more famous than most of us. That’s been great. I think we’ve kind of gotten into a little bit more into observability into areas that I hadn’t thought about, I come back to the idea of information contracts. You and I came up with it here. I’ve still got to ChatGPT at in a minute to make sure it’s not there anyway. So if people wanted to get hold of you, they wanted to find you without asking ChatGPT, what’s the best way for people to get in touch with you and what you do?

Raj Joseph: Email, raj@dqlabs.ai, or you can just reach out on LinkedIn. Again, Raj Joseph, and that’s my profile link LinkedIn. So one of the other works, I also have a phone number in the LinkedIn profile. So you can text me.

Raj Joseph: I think LinkedIn and email is the best medium to reach out. And then after that, definitely, I’m happy to always talk. Because there is so much innovation and so much ideas and things that comes to the surface every minute and every moment. So happy to entertain.

Shane Gibson: That’s great. Thanks for the time. Thanks for sharing some patterns. And thanks for chatting about all things observability. We’ll catch you all later.

Raj Joseph: Thank you so much, and thanks for having me. I appreciate it.