Knowledge Graphs with Juan Sequeda

Dec 13, 2023 | AgileData Podcast, Podcast

Shane Gibson and Juan Sequeda discuss Knowledge Graphs.

 

They discuss:

 

  • Career Journey: Juan shares his journey from a computer science teacher to becoming a principal scientist, with a focus on semantic web and knowledge graphs.
  • Knowledge Graphs Explained: Knowledge graphs represent collections of real-world concepts and their interrelationships in graph form. They are used for modeling domains like e-commerce, linking concepts like orders, customers, and products.
  • Semantic Web and Data Integration: The significance of the semantic web in the evolution of the internet and its role in data integration. Highlighting the standardisation of semantic web technologies, including RDF, OWL, and SPARQL, and their role in modern technology and data representation.
  • Significance of Semantics in Data: The critical role of semantics in data interpretation and integration, illustrating how different interpretations of the term ‘Paris Hilton’ demonstrate the need for semantic clarity.
  • Data Virtualisation and Federation: Semantic virtualisation and federation, allowing for the translation of graph model queries over SQL sources. 
  • Challenges in Adopting Knowledge Graphs: The barriers to adopting knowledge graphs in enterprises, focusing on the need for business literacy among technologists and the importance of understanding the business context. 
  • The ‘Crawl, Walk, Run’ Approach to Data Management: A phased approach to understanding and managing enterprise data, starting with metadata cataloging and gradually advancing to more complex interpretations and access.
  • Ontologies and Their Role in Knowledge Representation: The concept of ontologies as a means of representing knowledge, detailing how they encompass classes, concepts, and relationships.
  • Use Cases for Knowledge Graphs: When to use knowledge graphs – particularly for applications requiring flexibility, resilience, and integration of diverse data and metadata sources.
  • Future of Knowledge Graphs with LLMs: The impact of Large Language Models (LLMs) on knowledge graphs, predicting a significant increase in their adoption and usefulness in enterprises.

Listen on your favourite Podcast Platform

| Apple Podcast | Spotify | YouTube | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |

Recommended Books

Data Storytelling

Podcast Transcript

Read along you will

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Juan: Hi Shane, I’m Juan Sequeda. I’m the principal scientist and the head of the AI lab at data. world. It’s a pleasure to be here.

Shane: Hey Juan, today we’re going to discuss the pattern that is knowledge graphs and, to be fair, I think I know what a knowledge graph is. But then I don’t. I’m really looking forward to this deep dive, picking your big brain of knowledge graphiness, to figure out what it is, why we use them what are all these terms we mean.

But before we do that, why don’t you give us a bit of background about yourself and how you got into the wonderful world of knowledge graphs and data. Go to

Juan: back, I’ve been in this kind of space of not getting involved in knowledge graphs before that I was actually a teacher. probably oh gosh, almost now, almost 20 years ago, probably 18. So my background, I’m a computer scientist. I did my undergraduate and my PhD in computer science at the University of Texas at Austin.

But early on I got involved into this whole area called the semantic web. And it was the early vision of how the web was, is evolving. And for me, One of the most interesting aspects that I learned was, hey, what does it mean, the semantics part, right? So if you think about, the example was on search, so when I think about 2005, how the web was at that time.

And search was the main kind of thing that we’re doing. I would see Google and the example was, if you’re searching for Paris Hilton, what do you mean? Do you mean the. The person, Paris Hilton, or you mean the Hotel Hilton in the city, Paris. And which city? Paris. So that for me was like my aha moment. I was like, Oh, there, the syntax, semantics and search.

And then this is data coming from multiple places and then falling into the whole area of data integration. And so that’s how I got into that. It as an undergraduate, I found it super interesting. I had a previous company at that time. And just doing data stuff. And I realized, Oh, semantics, the meaning of things, this is a big gap.

That ended up being my research where I decided to go do my PhD was really on figuring out how to integrate all these different technologies. You have your relational databases, which have existed since the seventies.

And then you have all these, at this time, the technologies that were being standardized by the W3C, which are the semantic web standards specifically. RDF, which is a graph data model, OWL, which is an ontology language where to define semantics, SPARQL, which is a graph query language. And also note that these are just modern manifestation of these technologies, being able to implement defined semantics and all representation has also been something that in computer science we’ve been doing for.

Decades and decades. So this isn’t new. So I was just trying to understand like, what is the relationship between these modern technologies? You have relational databases and these semantic web technologies. And that was my PhD. And from there we did a lot of theory and systems work.

And basically one of the deliverables of my PhD was being able to define a semantic layer in ontology, being able to create mappings from your sources to your target, right? Mappings, transformations, rules, and then be able to do virtualization to take a query in terms of the target.

And in this case, it was a graph model and be able to translate that over your source model, which was a SQL. And so we did all that and all this semantic virtualization, federation work. And that was a company I started back around 2014 out of my PhD, a company called Capcento. And we did that kind of early on before the modern data stack.

Things before dbt or anything for an airflow and stuff. Long story short folks at data at world, I’ve known them for a long time. And it was just a match in heaven at the right time, bright vision, right timing. And I sold my previous company, Capcenter Data at World over four years ago.

We’re on that mission to be able to help organize the enterprises data in the structured world, you have all these SQL databases and people. So it’s all about first bringing in your metadata and cataloging that. And eventually what I call the crawl, walk and run is understand what you have, and then from there, understand what these things mean, and then be able to go access the data in terms of that meaning, which end users and the organization are the ones that talk about, anyways, that was a very long answer.

Shane: Hehehe… Now, good answer, so good journey. And I think… Yeah, I don’t think we can talk about knowledge graphs without talking about semantics and semantic layers, but maybe we’ll get to semantics later because that’s quite a deep discussion at the moment, especially in the data world because it’s all hot, but let’s go back to knowledge graphs.

So if you, pretend you’re ChatGBT, explain to an eight year old what is a knowledge graph, and I won’t get you to do it in the style of a pirate. You can do it in any style you want.

Juan: All right. Look the way I like to keep this is that a knowledge graph represents a collection of real world concepts and the relationships between those real world concepts. And they happen to be in the way of a graph. So the way to think about this is real world concepts.

Let’s think about a domain, the e commerce domain, so you have orders. You have customers, you have products and so forth. You have addresses, billing. So then those are the real world concepts and you start making these relationships between them. Oh, and a customer places an order.

An order consists of a set of lines. Alliance has products. So effectively this is this is your, kind of your enterprise data model, your ER schema and stuff like that. So that’s one thing is that you start defining these. These concepts, and this is where the knowledge part comes, the meaning, right?

These concepts and relationships between the concepts are first class citizens. , this actually encodes what that knowledge is and we can get much more expressive, right? A customer can place multiple orders, right? An order is shipped to exactly one address and so forth. And then the data part, the graph part is really, now you start bringing in the structure of the data that is also in a graph.

If you think about it, in your relational world, things are defined in a way to model an application. And here we’re modeling data. So it represents actually that domain that end users Think about it, the real world concepts in your domain. Bottom line, when you start thinking about a domain, something you want to go model, you end up going to the whiteboarding.

You draw bubbles and lines. It’s already a graph. We naturally think about it this way. So we’re just making this as a first class citizen. It’s really a way to integrate data and knowledge at scale. And why it’s a graph is because it’s built on the whole premise of the web, because the web itself is a graph.

Everything is linked.

Shane: Okay, so let’s unpick that a bit. When I’m coaching data and analytics teams, one of the exercises I get to do is one Based on a great TED talk called How to Make Toast. And the way it goes is you give them a blank piece of paper and you go, Draw how you make toast.

And what they end up doing is they end up drawing a bunch of pictures, because no words are allowed, so they draw a bunch of pictures and a bunch of lines and some of them have lots of steps, and some of them have a few steps, and some of them are funny, and some of them are well drawn.

And it all goes back to this how to make toast TED talk, and this idea of nodes and links. And so what we do is we then take that process, and we say, okay, how does this work in your day job, right? What’s the value stream you do? What do you do with data? Who does what? What’s a node and what’s a link?

And so for me, a node’s a dot. It’s a circle. It’s a thing that’s done. And a link is a relationship to the next step. And when I think about the way you describe graph and knowledge graphs, what you’re saying is we have some core business concepts, customer, product, employee, those kind of things in their circles

their nodes. And then we’re linking them to something else to say there’s a relationship. And the key thing being, the line itself is a first class citizen, where in the data world, typically we go, I’ve got a table and a table, and the line actually is just a blob of code. It’s got many first class citizens embedded in that code, I’m conforming customer records, I’m cleansing addresses, I’m doing a whole lot of nasty shit to that data, and it’s all encapsulated in this blobby of…

Crappy code in the middle of the line and it has no first class citizen, it’s a bunch of things. It’s a lego bucket of stuff.

Juan: And to add to that is a lot of the semantics, a lot of the knowledge is embedded in the application code, so it’s not connected to the data. So the software that uses that data, in that software, a human being took knowledge, took meaning of something and they coded it in Java and Python or whatever.

And that is disconnected from the data that’s being stored. So these things are disconnected. So if I show you the data by itself, You don’t know what it means. If I show you the code, you’re like, okay, I think I know what it means, but I don’t see the data , so I need to put these two things together.

You want these two things that should ideally come together. And this is a mindset shift because we don’t think about it that way. We’ve just traditionally been grown up to say, yeah, you have a database where you store your data and then you write code and the code implements all these requirements, but all these requirements have.

Clear semantics meaning. So what happens is that the next application comes along and then what do you do? You reinvent, you add again, the semantics and you do that again, over and over again. And are you interpreting the same way? I don’t know. So the idea is that you connect all this knowledge and data together, such that knowledge and data together are first class citizen, and it is independent of the application.

Applications will come and go for the rest of our lives. The date and the mean of it how your cost, how your company is organized, like that stuff should be first class citizens and they should live together. Not dependent to particular application.

Shane: Let me give an example then and see how this plays out. In our product, what we do is our lines, our code is effectively a rules engine, and we use a pattern from Gherkin from the testing language, which goes, given this data and, then do something.

So let’s say somebody’s done that horrible bloody thing in an application where they’ve built a model of thing is a thing of a thing. So I have a table of things, and I have another table of thing types and I’ve got to connect, thing table to thing type table to figure out.

Whether it’s a customer, it’s an employee, it’s a supplier, so table of everybody in the world and then a typing table which is really efficient for an operational system but incredibly shitty to use from an analytics point of view. So what we do is we’d say grab that table called thing so that’s the given this table good thing and these IDs, then filter out where type equals customer.

Right, and then populate your customer entity your customer node. So what I’ve got now is I’ve got two circles and a line. Circle one is table thing, circle two is table customer middle my blobby bit of code that says filter on, on thing type equals customer, but that logic is just a bunch of code.

And so what you’re saying is we bring out this first class citizen of customer and we probably bring out this thing of type and rather than being embedded in the code it’s stored as an object with a relationship and then ideally the code’s talking to that semantic model, that graph model and saying let’s use that as config or data rather than encapsulating it in the code itself,

so now , I have something that’s been described, I’m calling that. That thing in its context that it has, and using it rather than just embedding it in the code with no context, no semantics, no understanding. Apart from the person who wrote it, is that right?

Juan: This is exactly it. And this is what these languages, these W3C standards, like OWL, which is the web ontology language, are made to do. You have ways to express all these types of semantics, meaning hey, this is a class. This is a property. The property has a domain and has a range. The domain can be a class, it could be a union of classes, it can be an intersection of things, right?

You can get very expressive around these, around this stuff, and you may not need all this expressiveness, but you have that power there. So at the end, all the semantics, the meaning, just become codified. At the end of the day, this is all metadata.

Shane: sO that changes the way we work though, because what we’re effectively doing is going and populating some form of, let’s call it a dictionary. A list of words, and a list of contexts, and a list of relationships. And then the code’s using that. Ideally without us. And that’s quite different

Between, deterministic and probabilistic, I always get those things wrong. But, what we’re saying is rather than me going in and being, very deterministic in terms of this is the answer, what I’m doing is I’m saying here’s a bunch of things, you tell me what the answer is, and that’s a completely different way of working.

And I have seen some tools attempt that over the years to be That transformation code engine that’s based on that way of working. But what I find is that complexity, that cognition as a human to be able to understand how to define these objects and then just call the relationships to do our transformation of data.

That’s quite a large cognition level compared to just writing code, to writing a set of steps and being able to test those steps. So is that what you’ve seen? Is that… The complexity of this way of working in the data space means actually it’s one of the reasons that it hasn’t been well adopted.

Juan: Things haven’t been as adopted as I wish is, it’s really a human incentive. After a decade, looking at this stuff, I’m like, , here is the ideal way of doing this. Why don’t you do this? It’s because it’s incentives. First of all, change is hard.

And I think one of my favorite quotes from Ludwig Wittgenstein, the limits of my language are the limits of my world. So if all, I already know how to do things in SQL and Java, the same I’m just don’t change it. And if you show me something else, I’m going to compare it to that, right?

If I have a hammer, everything’s a nail. So you’re used to these things. Now, what happens is that because you’re used to these things, you are efficient in them. If I have a task, I’m being tasked, to do something. I know how to go do it and we’ve been doing all these things for decades and decades.

So it’s not that it’s doesn’t work. It does work. But the issue is the amount of continuous work that you have to do it again. Oh, here’s a change. Here’s a new application. You do all this work over and over again, but we’re just used to it, and this is why , all the silos get created, 

you have one system and then we move into another system, but we’re just focusing on our present, on being efficient to do this today, because I am being incentivized to get the job done as fast as possible right now, and I have the tools, so don’t bother me, let me get it done. We’re not being incentivized to see what are the implications about doing it this way?

So I call this always the balance between being efficient and being resilient. We are focused and we are being incentivized to be efficient. We are focused and being incentivized to focus on the use cases of today. We are not being incentivized. To be resilient, we are not being incentivized to say, I’m going to do this, probably a little bit more work right now, but it’s going to help me scale for later because I’m not incentivized to think about resiliency.

I’m not incentivized to think about the unknown use cases of tomorrow and then that goes change. Why not? Because my job is to get this done in one or two years. I’m not being paid to go do this. Four to five years, because guess what? I’m probably gonna leave anyways, right? My bonus is tight for the next one or two years.

I don’t care about all these things. So we’re not incentivized towards these things. So that’s one aspect that I think the reason why the thinking about semantics and knowledge graphs hasn’t really picked up as much as we as I would like to, number one. Number two, 

the second one, it’s because semantics. Understanding, meaning. This means that you have to talk to humans. And ooooh, that’s scary. And as a technologist, I have a career in technology because I just want to code and do things and I don’t want to talk to people. And this is, I think we’ve hit the wall.

Any new technology that’s going to think about semantics and stuff, where you’re not talking to people. All you’re doing is reinventing the wheel and coming up with a different syntax, right? ELT, ETL is the same stuff, it’s the batch and streaming, you’re just moving data. Databases, data lakes, data warehouses, lake houses, they’re all different types of storage and compute engines.

Creating a new thing is not going to solve this problem. What we really need to go focus on is on methodologies, understanding the roles of knowledge engineers, of data product managers, so it’s more of the people, the process side. I think that I’m now starting to see a hope that is changing. And I would argue that this is really one of the, in the last couple of years, the whole data mesh movement really helps us focus on the data product management and getting requirements.

I think that’s a huge step in that direction. So that’s the second thing is it’s a people thing. But I’m very extremely hopeful. Because one, people are afraid of the people thing because it’s manual, guess what? LLMs help us to automate a bunch of stuff, and I’m not saying just automate everything, it’s like it’s helping us be much more productive, so that’s gonna help.

And second, I think all large big tech companies use knowledge graphs, At Google search, everything is in a knowledge graph, Amazon, Netflix, all these things use knowledge graph in the backend. So the big companies have already realized it. Because LLMs are going to help us accelerate this.

We’re going to start seeing the barriers to come down and then realizing, Oh yeah. The incentives may not change, but at least things are going to be faster. People are thinking about, Hey, I can do a much better job now faster because I have these LLMs that can help me do it. Okay, that was a lot.

Shane: that was a lot. Let me unpack some of that. So we’ll roll back and we’ll go through it. So I agree around speed to market versus resilience, that, that sugar rush versus the consequences of that sugar rush. And we’ve seen it before. 

As humans we love to repeat the same cycle and wonder why bad things happen.

Or we wear the consequences. We’ve seen BI tools come out. That were self service. Then we saw something like business objects come out with their universe, their former semantic layer so a single pattern of a semantic layer rather than the broad brush. But what we saw then was all the reports going and talking to one thing to get that semantic meaning before it was used.

And then we saw Tableau come out. Rise of self service BI and visualization with no universe, no semantic layer. So what we saw is massive enablement which is a good thing. We want more people to be able to do the work that we used to do. But then we see chaos, because with self service comes chaos.

So we saw thousands of Tableau dashboards, and then again, we start seeing some centralization again. And then we see dbt come out. Yeah, again, good thing. Let’s help people who can write SQL to do the data work and they don’t have to use these, enterprise data transformation tools that were always hard to use.

Self service, great, democratization, but again, chaos, now I’ve got 5, 000 dbt models and don’t get me started on a rant about the use of that term as a model. There should be semantic crimes where you go and do that, you use a word in a completely different way, you should just get put in prison, 

orange suit, you sit in the naughty data prison, and whoever defined dbt code block, a blob of code as a model, goes to naughty orange data prison in my head. But anyway so 5, 000 blobs of code, and so again, we see that with self service comes chaos, that’s one of the problems. The second one is LLMs are going to make it worse.

And here’s a question for you. I constantly get people asking me to use LLMs in our product to auto create the description of a column. And when I talk about AI, I have my standard joke it used to be data mining back in the day. 80 percent of data mining is a group by, 5 percent is a regression.

Then maybe we might get on to, some K means and some murals and that kind of stuff. But the way I talk about it now is I talk about Ask AI, Assisted AI and Automated AI. And I say, Ask AI is where a person’s having a conversation with the machine. I have a question, it’s going to give me an answer.

I’ve got another question. I’m going to keep doing that until I’ve got what I think I need. I’m going to go and then do the action. So I’m using my cognition and I’m doing my action. Assisted AI is where I’m doing something and the machine’s watching, and then it’s coming back and making a recommendation.

They’re saying you’re working with that data and based on that data, I think this is the business key. I don’t know for sure, but I’m just hinting, oh, I’m looking at this data you’re working with and I think that’s a bunch of people’s names. You may want to mask that or secure it, so it’s, the machine’s looking and it’s assisting and recommending me for some stuff that I may or may not need to know to do.

And the third one’s automated AI, which is the machine doesn’t, the human doesn’t do anything. So the machine just completely runs it, something where the machine is actually just doing the work for us and the humans never involved, there’s no feedback looping.

When I think about LLMs on column descriptions what worries me is I run the LLM, it looks at the data in the column, it creates a description for that column, and I do that for 70, 000 SAP columns. Now, our consumers are going to come in and see that description, and they’re going to see there’s a description there, and it will be well worded, 

because that’s what LLMs do. They give you a sentence that looks like a bloody good sentence. So they’re going to trust that actually is the description of that field. But no human’s been involved. There’s been no review or feedback loop, and so my view is maybe we can use the LLMs as a recommendation engine, where the human can look at it and say, yeah, that’s good, I’m happy with that, approve.

But then we fall into human behavior, which is tick, tick, tick, speed over resilience. So that’s going to look at them and go yeah, job done, because there’s no consequences to them of that semantic meaning being wrong. So I think, that standard pattern of LLMs giving us descriptions for data columns is an incredibly dangerous pattern if we think about resilience.

What’s your thought on that?

Juan: So it always, the pendulum swings and all this stuff, the way you describe things is from a very technical perspective. So the question here is, why do we need descriptions, and then, okay, why do we need descriptions on the 70, 000 columns from SAP? How is this going to provide value to whom?

Explain that to me. This goes back into the people side, as technologists. We don’t ask these questions. We should really be pushing back. If somebody tells me we need to add descriptions to 70, 000, I was, I should be saying, Who the fuck is asking for this?

And why, how is this going to be useful for anybody? So that is the problem. And I think this is the problem of being focused. So technology, and this is my annoyance. What I did we have to be data literacy. It’s no, screw that. We need to have business literacy. We need to have the tech people being able to ask those questions.

So that’s a big change. So then you go off and you say here is the problem that we’re trying to go solve. And you understand how that’s valuable to the company. Then you say, okay, what do we need to be able to go solve that problem? And then you scope it down and bring it. And then you say I don’t need 70, 000.

I probably need a hundred or whatever. Then you are incentivized to figure out what those things are. So I think the technology is there. We just got to use it and use it. For our productivity means, but we are the ones, the humans are the need to talk to other humans to understand what the heck we’re doing and why.

And I think this is the, this is what I would call the business literacy. So that’s my answer, which is an indirect answer to what you’re saying. This is the problem, looking at the problem from a technology point of view.

Shane: One of the other problems that , I see in the space around knowledge graphs is actually a semantic problem, so when you say graph, In your head, I’m assuming you’re thinking nodes and links, you’re thinking those visual graph models that always look like boobies to me, 

you stand in the middle where the nipple is and you move the outside around and it looks like a wobbly jelly. But to me, that’s how you think. Whereas when you say graph to me, I typically think bar chart, line chart, and not a pie chart so again, that, that semantic language of graph.

Actually has two different meanings, two different strong meanings, and probably five other meanings that I don’t know about. Is that what you find? Is that when you say graph, do you find that some people start going down visualizations

Juan: Honestly, no, I think if you’re if, if you’re talking to somebody who’s a, if they do think about it for that, then you’re like, it’s very evident if we’re miscommunicating and then you would clarify it. And I think that, so it’s not a thing I do. One thing that does happen a lot is that people think about knowledge graphs, sense of nodes and edges, and then they want to go see it.

I’m like, okay, you wanna go see the graph? Why? Really? It’s because they think it’s cool, but they end of the day, what problem are you solving? By looking at the graph I’m gonna shoot, right? If I show you one node by itself, it’s like showing you one row of data or 10 rows of data, like by looking at 10 rows of data, what is a, what a problem we’re gonna solve.

You’re actually gonna go query and do an analysis doing a group buy and then be able to do that, right? This is a big pet peeve I have with people. I want to go see the graph. It’s okay, I’ll show it to you. But so what? I’m going to, you want to see the whole thing? Here’s a hairball. Okay. What problem did that solve?

That’s a pet peeve and I push back so much on it, but eventually people just, it’s like some eye candy. You have to go give it to them.

Shane: In the data transformation space, I call it mad person’s knitting. Where we see those lines all over the place and things going left, right

Juan: , people want to see the lineage. And I’m like, yeah, I get it. So if you have a very, data lineage, you have a very specific surgical thing you want to go look at. Yeah, definitely go see the lineage because it helps you with that. But if you want to zoom out and see everything, it’s okay, here it is.

You saw a bunch of lines going across all the different tables. So what you really want is a list of things. This is the most important table. This is the most critical one. Like you literally want is, at the end of the day, people want is a list of things.

Shane: Look we’re working on column level lineage at the moment, under the covers, we store everything at a column level, when you transform data, because the way we work is we use a config engine. We store this field when this column went to this column, 

and this is what we did to it. We just don’t display it. And the reason we don’t display it at the moment is whenever I think about visualizing it, I go back to what I know. I go back to linear graphs, I go back to the London undergrounds. This circle, this line, and this circle, and this line, the circle.

And I go when I do that for tables it’s messy enough, right? Mad person’s knitting to a degree. When I do it for columns, it’s horrendous, it’s a hairball. Like you said, it has, it looks beautiful and it says I’m very busy, but it doesn’t tell me anything. And yes, I can filter and isolate it, I can say click on this field and show everything, but that is what I’m used to doing, but it’s an unnatural behavior because now what happens is I see a bunch of circles and a bunch of lines, but the semantics is that the logic, the meaning, everything done is embedded in that line, it’s embedded in a bunch of code that’s not displayed on that picture. what do I have to do? I have to draw down, I have to draw down and see the words, the logic, the code that did that change. And so for me, I keep thinking about it more as a contract, as a brief, is why don’t, why isn’t column level linears just text? Why isn’t it just a document that said, Hey, You got custom, you got table and it’s tables made up of a bunch of things, customer employee and supplier, and then we go and filter it based on the customer type, to give us a list of customers.

Oh, and then we’re going to go grab customers from here and customers from here and we’re going to join them and merge them and we’re going to do that based on, first name, last name and date of birth. Oh, and by the way, we know that’s actually a really dangerous mechanism. Why don’t we just actually give somebody a story, 

A set of words that tell them what’s happening, because that’s what they’re going to understand. And they’re going to do that quickly because we know as humans, we can scan text, 

Juan: So there’s two things here. I think from one, the underlying lineage and column, like you should keep track of that. And by the way, this is why I always constantly say that your first application of a knowledge graph is of a data catalog of metadata, like everything you just described, like that should be represented.

As a graph, the, all the level draining the letter in a graph, right? So that’s why your first knowledge graph application should be a metadata and data catalogs should be built on a knowledge graph architecture. Now, the way to display that information to convey that story is different

it could be in the form of a visual graph, or as you just said, in a particular story. I think from a technical mindset, people are like, Oh, I want to see the graph, but they’re like, Whoa, but what is the task you’re trying to go solve? If you’re doing a cloud migration effort and they just give you this horrible what do you want to go do?

Maybe what I should go do is find the easiest things that have the highest value first and figure out how to do that. Okay. So how, what do you those are the types of questions we’re doing, then the typical, if I change this column here, what is that going to affect? If this dashboard is wrong, where could that be coming from?

Do you really need to go visualize that? Or do you really need to know the list of things that’s going to affect and the people you need to go to contact? Again, it’s a list of things. This is a story you want to go say so we really need to understand what is a task going to accomplish, who’s going to accomplish it.

And then there’s so many different paradigms that we can use to accomplish that. And so the visualization is not always a thing to go do. Again, it’s pretty sexy, pretty cool. But at the day, like from a metadata perspective. I think all this has to be in a graph. And I always tell people, if you’re not doing, if you’re not managing your metadata in a graph, you are basically reinventing and putting and try to put it into a graph to some other weird way.

It is a graph period.

Shane: Again, let’s carry on with the constraints, the things that are stopping people adopting this graph these graph patterns. And one of the interesting ones is you mentioned Google search, and the fact that Google search is built on a graph and I don’t think it was originally, was it?

I think I remember listening to a podcast, probably one of yours where they talked about the original, it was a page ranking mechanism. And then they went and bought a startup 

Juan: Yeah. So that was free base. So what happened, so what they what Google did was they innovated on search. But the thing about on search, it was just. The 10 blue links, and then over a decade ago you start seeing what we call these knowledge panels. You actually try, they try to understand the intent of the question, and they probably give you the concepts that you’re actually talking about.

And that change of the paradigm, that was because of the knowledge graph that they built. And actually, they take the marketing credit for using the term knowledge graph, even though it’s all built on all these technologies from that, from my academic point of view. My academic household, the Semantic Web, that’s where all this all comes from.

You would basically, if you look at all the folks at knowledge at Google, at that time, they all came from the Semantic Web academic community where they’re just bringing in all these technologies.

Shane: all right, and we’ll come back to that, because I think one of the examples you used on one of the podcasts was fire as in fire to garden or fire a person or light a fire. So I’ll come back to that in a minute, but I just want to go down around that access to technology. For what we’re building, we build it on Google Cloud.

We made the decision that we are single cloud, not multi cloud, and that had some benefits and it had some downsides. And so we use a bunch of Google services BigQuery, Spanner, a whole lot of serverless things that make our life so much easier and so much cheaper. Thank you. But the thing that really amazes me is we cannot get a graph engine out of Google.

Now, what we do is we have a graph model under the covers and we think relationally, we don’t think in terms, we come from a relational background, not a graph background. So what we’ve done is created some graph like relationships in a relational model in Spanner, but I really wanted to use a graph database and the natural way of doing that would be, to use the Google service.

But there is not one, there’s no service in Google Cloud that gives you a graph database. Now there is Neo4j and, third party that you can stand up as a quasi serverless capability, but that, that doesn’t suit us, we want native Google stuff. So why do you think that is?

Why Google, whose whole world is based on a graph, the one service I can’t Get from them is this graph database. And then if I look across all the cloud providers, you typically have to go to a specialist provider who host their graph database on that cloud provider. It’s not seen as a native service like a relational database, like a message queue, like a blob storage,

it’s not a first class citizen in terms of a technology stack.

Juan: I know this is a great observation and I think it’s just the markets, right? I think this is not where the market is talking about. And I hope, hopefully this will change. And the other one, Amazon is the only one that’s Amazon Neptune, right? And the other cloud providers don’t have, Google, Microsoft has some stuff.

But they have a couple of things there, but not very popular. And then as you’ve noted, Google doesn’t have anything. So I guess it’s just where the market is right now. And my hope is that I’ve been working towards this for. two decades. Absolutely. I passionately believe that this is the right way.

I genuinely believe that the right way to go manage enterprise data and metadata at scale was in the form of a graph, because if you don’t, you’re going to end up just repeating it in a different ways. Specifically you want your metadata to be a first class citizen. And also very important thing is that you want your identifiers.

To be a first class citizen. And I think this is something that people don’t really realize is that why is data integration so hard because we don’t think about identity. So we always talk, Oh, what is a customer? Of course, their customers can have so many different means. Okay. So what are they?

So how would you uniquely identify a customer? But we don’t go about having these discussions and what is a unique global identifier for this thing? We don’t think about it. Now, if we actually thought about unique global identifiers from the beginning, data integration would be much easier because I would know that if I’m data, I’m like which identifiers am I going to go use?

Can I reuse , something that exists? Am I going to go create some new things? You start thinking about this and then this makes data integration easier. We never think about identifiers from a global perspective. We only think about it from like internally in my application or primary key but you don’t think about it globally.

Now, this is one of the reasons why the whole knowledge graph comes from the semantic web world is because the web is all about identifiers and HTTP URI. The identifier. The URL is a locator. That’s what actually brings you back to the page that you see in a browser. But the URI is that identifier. We think about those things

on the web, there is no identifier that will take you to two different pages. It’s a unique thing, this is one of the things that is missing. And just, I think people aren’t even thinking and trained about it. Think about identity. In a way it’s also philosophical too, but we need to have these discussions because otherwise they get pushed all the way upstream and they’re like I got these and I got this report about customers, but you don’t use my definition of a customer.

Cause we never talked about it, never defined it and never thought about how to uniquely identify them.

Shane: Yeah, it’s funny, isn’t it, that we will put massive guardrails around our cloud adoption and our databases. We’ll put a guardrail on it that says your database has to be secured. You can’t just go in with a public identifier and access it, you have to actually say who you are.

You have to have some form of recovery and backup and, there has to be some way for it to not fail. So we do that, yet we don’t do that with… The actual data itself, we say yeah, you can create a unique ID for customer in two different places and they can be different. And then somebody also do that horrible work.

And, we love to laugh at governments, but if you think about governments, they’ve actually been relatively good at building identifiers for citizens, in terms of social security number or passport or driver’s license over here. Because they need to actually be able to identify a person as a person at some stage.

Yet, as corporates, we have much more control over our data, but we don’t do it, we let that problem happen to us, knowing it’s going to be a problem and not caring. And again, probably speed to market versus resilience, 

So let’s take that URI point of view. Then what you’re saying is does every thing get its own URI? So there’s just like one bucket of for every classism.

Juan: So if you look at the standards for the knowledge graph standards for RDF, and now everything has a URI. Every concept, every relationship, every instance everything has a URI. Because you can then point to it. And then even to the point that I can have different identifiers, but they, if they mean the same thing, then I can say, Hey, this identifier, they mean the same thing.

At the end, it’s all been, it’s all about these connections. That, that’s why it’s all a graph. So an example for folks to listen to, just go to schema. org. For example, schema. org is a schema, an ontology that has been developed over the last decade in a community driven way by Google and Yahoo and Bing and all these folks, which they’ve defined the schema the ontology that they want the web to be marked up with.

So then you’ll have any website that’s been created. They can now use these. Semantic markup. And how does everybody know what this stuff means? Because I use the identifier for it. And you can go through schema. org and you can see every single concept that has been defined, every single property.

And they all have a unique identifier. If you think about JSON, that key, the key value pair, that key should actually be a unique identifier. So if anybody goes in and they like, it’s unambiguous.

What does this key mean? Oh, it’s a unique identifier. I can look up what identifier, I know what it means. And it’s machine readable too. So the machine should be able to go figure out on their own, getting it to be more autonomous too.

Shane: OK, so that schema. org doesn’t go to enterprise models, it doesn’t say… OK,

Juan: been created, that’s what we should be doing also to in the enterprise. For you, so enterprise models today, although the enterprise architects get together and they go draw it and they draw it on a UML diagram and then it just stays in a pen and paper and stuff.

It’s we should take that to the next level. Like that model. is very important semantics. Now let’s codify that. Let’s turn that into something machine readable. Now what usually happens is that we then turn that into a SQL DDL, for example. But then that stays there, and it doesn’t get then they update the model, they update the schema, the SQL schema separately from the model, it’s disconnect.

But that SQL DDL, think about it, SQL DDL, It has a very simplistic semantics, right? You have tables and column names. But I want to express more things. This is a relationship. It’s a one to one relationship. I want to have synonyms. I want to have labels in different languages. I want it like, there’s more things that SQL DDL does not let me do.

That’s where the semantics, that’s what a true semantic layer in ontology would be. I can define these things and then You can use that semantic layer to turn it into a JSON schema, turn it into a SQL DDL, turn it into an XSD or whatever at the end of the day, those are just different syntaxes that represents the same meaning across any type of these syntaxes.

That meaning, that’s the ontology that should be independent of any application. Okay.

Shane: used the word ontology a bit and it’s another one of those light knowledge graphs. It’s another one that I think I get it, but I think I don’t. So again, one GPT. Give us a description of an ontology for an 8 year old or for me. How would you describe to somebody? What the hell is an ontology?

Juan: Look so simply an ontology is just your, the simplest ontology is a schema. So it’s a way of modeling. yOur domain. So the simplest way of thinking about an ontology is you have your classes, your concepts and they have relationships between them. That’s already a simple ontology and then you can think about a simple schema like your tables with your columns.

And then an ontology just really represents the way how we see the world and you can start saying more and more things about it. So for example, I say a customer. In an order, a customer is a concept, an order is a concept, then there’s a relationship. A customer places an order. So placed by is an order.

So then placed by is a relationship. And I say, this relationship is used by customers and it’s going to orders. And then I can add more expressive information. A customer can have multiple orders. And then I add more information, more knowledge about that. And that’s what an ontology is.

Shane: Okay so it’s a bag of words,

Juan: It’s a bag of work with the very clear meaning, with a very clear relationships between how they’re all

Shane: right, and some form of boundary,

Juan: then you think about this as a spectrum. So you can have a list of words, which is what? A business glossary terms, I have a list of words. Then I can say, hey, these different words all mean the same thing. Oh, so I have a thing, a concept, and you have different words. I have different synonyms around them.

Then I can say, hey, these different concepts have some sort of relationship. They’re like a broader relationship, a narrower relationship, so then you start getting into some sort of taxonomies, you start finding hierarchies, relationships between that. And then you can start saying, there, a concept can have another type of relationship to another concept that is not a subsumption hierarchical one.

Oh, then you start making these kind of these relationships an order is placed by a customer. And then you keep going. I can add more and more information to it. So there’s that spectrum of what I call expressivity. People should start with, come up with that bag of words list of business glossary terms.

And then people start saying, hey, those different words, they all mean the same thing. Okay, then we start relating that. Hey, then these other words now represent these concepts. Here’s some tech, hierarchical relationships. You can start building a taxonomy to these things. And then you start saying, hey, these other concepts have a different type of relationship between them.

Now you’re modeling the graph that’s how you realize you’re modeling the ontology right

Shane: To me it’s like data domains, we talk about data domains and then I say to an organization, how do you bound your domains, is it, here’s ways I’ve seen people do it. I see it an organizational structure so the domains will bound to the way your org chart works.

So you’ll have a finance domain and an HR domain because you have a finance team and an HR team. Or I may see your domains be based around core business process. So I might see, order to, to delivery, that’s a domain and all the things that happen in there because that’s how you think about it, 

there’s just choices. What you’re saying is ontology is a bag of words and you’ve got some boundary to say that it goes in there and you can have large ontologies and small ontologies, I’m assuming because you can extend the boundary or contract the boundary and the key thing is it’s every word so it’s not just the words of concepts but it’s the aliases, the relationships, it’s everything.

That gives us a semantic meaning for the words in that bag. Okay. That kind of makes sense to me now.

Juan: It’s a way of representing knowledge.

Shane: Then if I go back to knowledge graphs and again, I come from a technology side because that’s typically where I touch it, one of the things I see massive value for it is you typically define the left and the right. And the relationship between those two things.

And then the technology takes care of the complexity. If I say, let’s go back to that example that I heard somewhere where, I get fired, I fire a gun, and I light a fire. So there’s three different uses of the word fire. 

Juan: So they all use the same syntax F I R E, but they all mean different things so how do you differentiate this thing? When you now bring that back into reality, our enterprise world, those should have three different identifiers. So they’re unambiguous.

Shane: for the word fire.

Juan: Yes, they have three different identifiers. They happen to have the same label, but what I should be talking about is the identity, not the label. Because the label, they can have the same labels. So this is why identity is crucial. So if you’re like. Fire. I’m like, which fire? Here’s the identifier for, oh, great.

I know the identifier for that thing, and that’s not the one I’m talking about, so I can ignore it or whatever. That’s why identity is important. It’s critical. It’s crucial.

Shane: let’s use that example. So I get the identifier for fire and I’m going to get back a label, but it’s only when I get back the relationship to another term that I’m going to get the context for me if I see fires related to gun. I’M going to go, okay, if I see fires related to employee or fires related to heat, it’s that second word and the relationship between it that allows my brain to figure out the context you’re using

Juan: Yeah, exactly. Because you, because things are defined in the context of other things in a way. So as you said it because now think about it as literally what the triple is I have an identifier, let’s call it, let’s call this identifier A is type of a class. And A has a label fire, 

and I can say I’m going to define another identifier for the other concept of fire. It’s going to be the same thing, but it’s going to have a different identifier of B. So the question is, I don’t know, what is the difference between A and B? A will be connected to some other things and B will be connected to some other things and that’s how I, that’s the context that you want about it.

Shane: And then we can also have a relationship between A and B. 

Juan: Possibly too, whatever.

Shane: to say it’s a re use of the same four letters, and therefore it’s a dangerous thing, one of the benefits of a knowledge graph technology is I don’t have to worry about all that complexity, because all I ever do is I say I have a left and a right.

And a relationship between them, and I define those three things that there is an A, the word fire. There is a a 123, which is the word gun, and then there’s a third object, which is that relationship, to say those two things are related, and all I have to do is put into the graph those three things.

A is related to 123, and then the graph takes care of whether they identify 123. Wherever gun is used, it takes care of wherever A, wherever fire is used again. It does all that, that relationship mapping and complexity of that many to many relationships in the background for me. And that’s one of the things I liked about graphs, is I don’t have to worry about boiling the ocean, as you say.

I can just put in a bunch of relationships. A bunch of two nodes and a link between them and the graph engine, the database takes care of all the complexity of everything else it is. And it allows me to answer those questions the example I remember in fraud many years ago in a previous life was a whiplash for cash, 

so what this one was there were a bunch of people in the US there was normally five to seven of them. And what they would do is they would pretend there was a car crash, they would pretend somebody got whiplash and then they’d get an insurance claim.

But what happened is each of the people changed roles. In the first crash, Person A was the driver, Person B was in the other car, Person C was the passenger and got whiplash, Person D was the doctor who assigned the whiplash to them and they got a claim. And then next time there was a crash, person A was actually the passenger who got whiplash.

Person B was actually the doctor that assigned it. So they keep changing roles. They worked out this pattern and I’m not sure they actually used them on this graph, but that whole mapping of these people are playing these roles in many different things was a good use of that.

Kind of graph model for me, because then they could identify these things will happen, oh, person A has been involved in a hundred crashes, but each time , there’s a different relationship. So for me, that idea of being able to put in small bits of relationship into the engine and the engine showing us how it all relates around us having to do that ourselves.

That’s one of the values of using a graph engine

Juan: Completely agree to think about this. It’s really about the flexibility. and the flexibility enables you to be resilient and adapt to change, so when should you use a knowledge graph and when should you not use a knowledge graph?

So if you’re focused on creating just one specific application, that this is one great, defined things and that’s it. Don’t use a knowledge graph, go build it in a non relational database. Go create your JSONs, do a Mongo, whatever, just do that one thing. That’s all you want to go do. But if you’re creating an application that is, that you’re going to give it some requirements, but those requirements are going to change.

You’re going to need to bring in more data. You don’t know how that’s going to evolve. If you’re starting to go manage and integrating data from different sources. Then you want to have a stack that is going to enable you to deal with the use cases of today, but also with the use cases of tomorrow.

Therefore, be agile, be resilient. That’s when you should be focused on a knowledge graph, because you can get all those requirements and define your schema your ontology for today, and then you bring in your data and you’re done. Now, tomorrow you’re gonna get new requirements, new things are gonna change, new sources you need to go bring in.

can easily have the flexibility to go extend that. iN a graph perspective it’s really this common denominator data model. I can turn tables into graph. I can turn. Trees into graph, XML, JSON. If I do some natural language processing, I do some entity extraction, some relationship extraction, all of that, I can turn it into a graph and I can just add those things.

My taxonomy that is also can be in the graph. So then I just have this means of integrating all types of data and metadata and knowledge at the same time, very easily to go expand that. Again, if you’re just focusing on a very specific application and you don’t, and your goals are, then go do whatever, however you’re comfortable with.

But if you need to go deal with the known use cases of today and the unknown use cases of tomorrow where you need to have that flexibility to be able to go very, be able to rapidly, be an agile way to extend things. This is when you need to start thinking about knowledge gap, because you’re going to start thinking about, Hey, what, a lot of the what ifs.

So you’re preparing yourself for the future. You’re preparing yourself by having schemas that are well defined, that can be extended. You’re defining. Identifier so that people can go reuse them and they can also go extend them if needed and so forth. So this is when you should be using knowledge graphs.

Shane: So when do you not use a knowledge graph? One of the examples that, I tried many years ago, and it was an epic failure, was putting the actual data into a knowledge graph so you go knowledge graph is just a database it’s a really flexible database. So why don’t we just bypass this disconnection, 

why do we say Metadata and context and all that sitting in my graph, and the physical data sitting on a, storage lake or in a relational database. Why can’t I just push all my data to the graph itself? And back then what I found was, it wasn’t really designed for large volumes of rows,

Juan: Yeah the industry has changed a lot, think about it 10 years ago, even five years ago, the scalability of graph database systems were not as well and they continue to improve and we’ll continue to see this. But I, for one, this was actually my. And this was my, this was the company.

And this is the technology we have at Data. World is all about virtualization. Your data, your transactional or analytical data is going to live in your data warehouse, data lake, whatever, in your snowflake, in your BigQuery, whatever, I want to be able to integrate that with other systems.

So you can have a knowledge graph layer on top of it, which then you would, you can map it. And then you can virtually access it. So that’s stuff that we do, for example. That was my PhD over 10 years ago, which was like, Oh, I can take graph queries and using the mappings, be able to translate those graphs, semantic queries, and then translate them to the SQL queries underneath and be able to go execute in a different source such that you don’t have to go move all your data centrally into a graph database.

You want to have virtualization. And then there’s some things that you want to have, as always, right? Hybrid, right? Depends what some things you do want to go move. Metadata in general is usually smaller, so I think that natively will live in a graph.

Just because you extract metadata from so many different and the metadata gets connected, especially because things move and metadata is keeping track of all those movements. So that naturally is a graph itself too. So

Shane: but I’ve seen this wave a couple of times, and I think, one of the vendors, the spin they’re putting on it now is active metadata and the pattern is fairly simple. Because when we talk about transforming data, like we do in data management, there’s two patterns. We write code to transform it, and we try and extract the meaning from that code into something where we can see it, put it together, and understand it.

Which is the current DBT world again. Or we go the other way, which is we define the relationships, and we define the semantic, and we define that graph effectively, and the code’s written for us. And we tend to swap between those two worlds. And, I think my view is we’re going to go back to this active metadata, this config driven code again, 

cause we go in waves the abstraction, the writing code, and then try and suck meaning out of the code, the creating a schema and trying to reverse engineer that schema into a data catalog is. Is where the pattern always ends up, we always end up doing the easy work and then trying to reverse engineer that into the hard piece.

Why do you think that is? Is it just because it’s easy to write code? Like you said, we know how it works,

Juan: this is the whole point that we started off with. It’s just technology and people don’t understand the business. They want to disconnect. This is where I think LLMs are the godsend right now because it will help us being able to make us more productive, to understand what things mean.

So just here’s the exercise, go off, ask a bunch of folks who work in different business units and literally ask them, what are the questions that you’re trying to go answer today? And just get a list of those questions. These questions are what we call competency questions, they’re things that they’re trying to go answer now.

Tell an LLM, here are the questions that people are trying to go ask, create an ontology, create a business glossary, create a semantic layer out of that stuff, it will generate something for you, it won’t be perfect, that’s why you go in and then you try to figure out what it means and then you draw it on the whiteboard and you go back to these people and you say hey is this how you see your world and then That question that you’re trying to go ask, you really basically traverse the graph

that’s what you mean then. This is stuff that we can now very easily go do because the LMSs have kept us very excited. There’s so much hype about LMS and yes, we gotta be careful about all the noise and all that hype, but it is a fact of life that is making everything so productive.

This is a pure example of how to go do this. One of the other things that I’ve been working on so much when it comes to LMS and knowledge graphs is to really understand. , what is the accuracy of LLMs when it comes to question answering? So we’re talking about, Oh, I have my data in a SQL database.

I want to be able to now write questions and that were the answers in a SQL database. So let me use these LLMs to translate text to SQL. This is where you should be really careful, because that can generate a bunch of stuff. And so a lot of the research I’ve been doing has been to understand that accuracy, and then also to say, Hey, if I invest in knowledge graphs, That has all that context.

How much does that accuracy improve? In all the research and we create this whole benchmark that the difference between using a knowledge graph and not using a knowledge graph is three X. Based on the benchmark, you have a 16 percent accuracy.

Of all these questions on enterprise schemas. 16 percent accuracy if you don’t use a knowledge graph. And if you use a knowledge graph, that jumps up to 54%. This one’s with basic prompt and stuff. so two things. One, knowledge graphs, invest in knowledge graphs because it has that really clear context that your business needs.

And second, The LLMs are there to make us more productive. By using this, we’re going to get out of our technical comfort zone and be able to start doing more stuff on the people side and not get overwhelmed because we have a tool that can help us automate some stuff.

Shane: I remember back in the old days when, I worked with analytics teams. We used to talk about mixed models, you’d run a bunch of different models, you’d see what the fit is. And then the trick that most of the teams did was when they were explaining to a stakeholder what the model was doing, they’d typically show a decision tree.

Now, the decision tree model wasn’t the model that was being run and productionized, but it was close enough to what the model they were using was actually doing that they could describe the flow of data, they could say, oh, for the segmentation, first thing it’s doing is picking up income and then it’s basing it off, gender and then location

it gave a story. A visual story that could understand it, and I think LLMs are going to give that to us. Like we did back then when, you had to test all the different models to see which was the best model for your context. It’s the same with LLMs. The text to SQL ones we’re playing with at the moment, we’ve seen how good it can be and how bad it can be, but the thing is it’s not consistent.

So simple things like ask it the same question three times and get back three different SQL blobs. Yeah, that’s a test to you that actually it’s not accurate.

Juan: In the benchmark that we did, we actually did test this. you’re non deterministic systems. It’s not like I give you a question, I get a generated query back. I always give you the same question, you’re always going to generate that. In the benchmark, we had 43 questions, which we ran for three weeks, nonstop, between 30 to 300 different runs, because we also wanted to see.

The percentage of the accuracy they get. So this is one of the things that we need to keep in consideration. Definitely.

Shane: Yeah, and it’s valuable, but it’s not a silver bullet. So

Juan: percent. Hundred percent.

Shane: Just to close this out so does that mean if we think about knowledge graphs in terms of the pattern of nodes and links and in terms of the technology that enables us to create those nodes and links and visualize them easier than if we’re doing it manually ourselves.

It’s been around for a long time, like you said, but it really hasn’t been adopted in the data world, it’s the second or even, I’d say it’s even the third class citizen, I’d say, it’s a powerful pattern, but one that is hardly ever used in the data domain. Do you think that LLMs is the thing that’s going to make it?

Become a first class

Juan: percent. Hundred percent. Our CEO Brett Heard says like startups are grit. And I actually think that from a research perspective that applies, especially if you’re this is my goal. What I want to do. I’ve been at this stuff for semantics and knowledge graphs for again, for almost two decades.

The web was something that changed the people’s lives. Is that right? The iPhone changed some stuff, right? Social networking changed some things. LLMs has come in. It’s going to change so much. My bet here, and this is what I’m literally betting on this now is LLMs are like that, that God sent saying, okay, this came out.

And this is going to be one of the motivators that are good. People are going to realize we have this problem. How do we solve this problem? People are like. Knowledge graphs that has always been that we can go use that. And it’s going to be much more critical in the enterprise because in the enterprise, what is the biggest concern of LLMs?

Hallucinations. And then a lot of the topic, the conversations around LLMs are about. Creating all the, look at these rag architecture stuff. It’s all about text, but really people are , not even talking about that much about SQL because that’s where a lot of our enterprise data is, but they’re not doing that much in there because they know that’s a huge issue about the accuracy.

And already our research has been pointing that by investing in knowledge graphs, you’re increasing the accuracy. And this is just going to get better and better. This for me is the opponent time. That’s why I’m super excited working so hard on this because this has literally been the gods have sent us something to say, this knowledge graph work is going to start booming because of the LLMs.

Shane: And if we take that example of our 70, 000 SAP tables maybe we don’t go look at the query logs to see who’s querying those columns to see what’s most important. Maybe we look at our chatbot. That we’re using to answer questions, and whenever that bot, whenever that LLM is using some fields of those 70, 000s on a regular basis to answer a question, that’s the time we need to go and extend it out with well formed descriptions, 

With those relationships, with the aliases, because we know Those fields are the ones that are getting used time and time again by the LLM.

Therefore, they’re the ones we have to enhance. Enhancing them have more value to us because we’re answering, we’re helping more of our consumers. Maybe it’s less about query logs now. More about where the LLMs are heading. And then how do we apply Knowledge Graph there.

And again, going back to hopefully, define config, have the config create and run the code for you. It’s all about that context. It’s all about that config. It’s not about the lines of SQL that you’re generating as a human, because they’re abstracted, they’re divorced, they’re siloed by default.

And yes, you can reverse engineer them, but why would you? Why wouldn’t you just do it properly at the beginning? I always talk about the web and I talk about iPhone moments. I always forget about social networks. I’m going to steal that one from you.

Web, iPhone, social networks, LLMs they’re the things

Juan: That’s been, and that’s just been the last 30 years, 

Shane: We’ve just got to realize that right now the LLMs are newtons. They’re the horrible smartphones that don’t work but at some stage they will, the iPhone will appear 

Juan: Think about 1991, 1992, when the web , came out. How many people are like, Oh, that’s a fat thing. I don’t pay attention. And look how it changed our world. 

Shane: I still remember the the modem dial up and nobody in the house been able to make a phone call because you’re surfing the web. That’s a funny memory. Excellent. I think I understand knowledge graphs a lot more.

I definitely understand ontologies and I’m definitely going to do the LLM trick I’m going to go and give it a bunch of questions until it’s Write me some of those things so I can get more examples that will help refine my understanding of that. But, key thing is, we’re still working in a complex world if we’re trying to combine knowledge graph technology with the rest of the stacks that we use in data management and it’s still not a common thing to do.

So hopefully that does change, hopefully we can leverage that out. Excellent! All right, thank you for coming on the show and I hope everybody has a Simply Magical Day!

Juan: you very much.