Reliability Engineering of AI Agents with Petr Pascenko

Jan 8, 2025 | AgileData Podcast, Podcast

Guests

Resources

Join Shane Gibson as he chats with Petr Pascenko on the pattern of Reliability Engineering of AI Agents

Listen on your favourite Podcast Platform

Podcast Transcript

Read along you will

Shane: Welcome to the Agile Data Podcast. I’m Shane Gibson.

Petr: Hi, my name is Petr Pascenko.

Shane: Hey Petr. Thank you for coming on the show today. We want to have a bit of a chat about reliability within AI agents. But before we do that, why don’t you give the audience a bit of background about yourself and your experience?

Petr: Sure. I am head of data science, machine learning, and AI in a mid sized software engineering company located in Prague. And my job is to architect AI applications and do machine learning projects for our customers, mostly from banking and and other corporate sphere. my background is in mathematics. I studied mathematical faculty in Prague and I do a lot of conventional data science before the AI boom. Statistics and learning and from the beginning of LLM boom we have jumped in and started to build applications and solutions based on it mostly for corporate purposes like work automation and building assistants and agents for the people in banks and corporations.

Shane: You’ve been working on a. Couple of specific use cases where you’re introducing the idea of agents and agentic behavior. Do you want to give us an overview of some of those use cases to give us some starting point for us to talk about them?

Petr: Yeah, I’m quite impressed by the progress of AI in the last couple of years. We have started with GPT window and people are playing with the tool and they are basically on their own. What you have asked. The AI you get. And later we see the beginning of AI assistance. So somebody has put the GPT inside some sort of app and give it data context and specified input output and enter the task for you.

So you as a user was able to use a specialized AI app. And now we are in the third year and we are building AI agents. So there are applications which are able to do quite reliable tasks on their own, and the users are really relying on them. And I have two examples from our recent works.

The first is Dora regulation. tool, which we built for one of the German banks, as you probably know European Union is a bureaucratic empire. There is really lots of stuff that that must be bureaucratically processed for the European Union. And one of them is what’s called Dora Regulation, which means that every bank has to make a check for all its contracts.

It has to its counterparts, to its contractors, and it’s really a lot of works because the regulation has something like 76 or so requirements, and you must check your contracts to every requirement. And to do that, you must get to the current version of the contract, because these contracts typically are signed in some time, and then there is a plenty of edit parts and changes during the years.

The first and, I think, very interesting topic is to make Final version of the contract. So we’ll take the initial version and all the changes and you apply them one by one, and you can get the version of the contract from particular date. So you can ask which is the version of the contract to 1st January, 2018.

And you get it by applying all the changes which was placed before. You get the final version and you compare it to the 76 requirements and the requirement can be split to something like this. 10 questions which check if the contract is ready. And the output of this tool is something like traffic lights, something which tells you it’s quite okay.

It’s not okay. It should be changed or it is completely in contradiction with the. And it must be altered and this is the background for the negotiation team, which start negotiation with the counterpart, which is quite complicated process. We have finished this project, last month and

the people from the banks, the legal team says that the work was reduced for every contract from something like 14 days to something like two days using this thing. And it’s reliable and it works. And this is something which is, I think, remarkable for the industry. One general takeaway, and that is that you can really rely on the AI in the things which are crucial and quite complex, because the legal texts are really difficult to understand for most of the human, and still we are able to get this level of precision to make it very useful.

So this is something which encouraged me at the beginning. I think one of the most interesting thing on AI is that at the beginning, we never know if the technology is mature enough to get to the finish of the problem. It’s like when Colm was set on the ship and go westwards and he wasn’t really sure if there is a India or America or if the world is flat and he will just throw over.

So this is our situation in every project. Until now, the technology always always shows that the potential is there.

Shane: So I just want to unpack that a little bit. In terms of AI, I talk about three core patterns. I talk about Ask AI, Assisted AI, and Automated AI. The way I describe it is fairly similar to what I think you’d have said. So I talk about Ask AI, which is. A chat interface, but the core pattern is a person asks a question, they get a response.

They are probably going to ask a second follow up question, get another response. They will go back and forward until they get something at the end that they’re happy with or not. And then they carry on and do the next action, the next task themselves. So they chat with the AI to get some information or some help, and then they carry on doing the work.

Assisted AI for me is this idea that we’re doing the work and we’re using AI technologies to watch the work we’re doing and it’s making recommendations. So it’s watching our space, it’s suggesting some things we could do differently based on what it sees. But again, The humans making that choice of yes, I’m going to accept that, or no, I’m not.

Then the third one is automated AI, which sounds to me like an agent. So that is the machine is doing all the work. There is no human involved. The work just gets done by the AI. Did I get that right? Is that the kind of three core patterns that you see being used at the moment, and you treated them as a bit of a maturity model that we started off with Ask AI, then we moved to Assisted, and now we’re finally into the automation through Agents.

Did I get that right?

Petr: Yeah, I think these are just different words for the same meaning. And we do all of them. And I think all of them are important because it’s not only about ai, but also about users. And the users must get some trust. What do you call dialogue with AI or the first stage, it is important for people just to, to play with it. We sometimes underestimate, as developers of AI who started to play with GPT three years ago, and we have plenty of hours of getting experience, that most of the people didn’t do that. They don’t try that or they tried and they stopped because they don’t see any practical usage for their work.

So if we just give them a few tools to have in the chat they can spend some time and get some trust. For example, one of the most successful applications. We do in banks it’s really simple even a little bit stupid but I don’t want to underestimate it. We call it dashboards.

And it’s just a webpage, which plenty of buttons, and they are automating some very simple tasks like summarize the email or give me some And the only difference between this and the plain GPT is that the prompts are a little bit altered and prescript to the particular task.

So the user don’t have to write a prompt from the beginning and you start with some prepared prompt and maybe , a little bit automation, like copy pasting from the office or bank systems and other things. People really love it. It. Because it’s automates some tedious parts of their job.

And also they have the feeling of the full control because they see the prompt, see the input, see the output. And if the AI makes a mistake, they even like it. A little bit more because they can say something like, Oh, I’m still necessary on my chair. And by the way this is something which I see on every conference when I speak about AI, that the most popular part is to show people that AI makes some mistakes

Shane: In the data world, the use case that most companies or vendors jump straight to is what we call text to SQL. So this idea of asking a natural language question in text and effectively the LLM writing the SQL. To query the data and give back an answer. And, as you can imagine, when you have a lovely curated data set that vendors use for demos, it’s amazing.

As soon as you put it in the real world against actual real data that has no real semantic meaning and the data is typically a mess the answers aren’t so accurate. And so as humans, we go, Oh, it’s not accurate, that number is wrong. I can prove that the answer it gave me for total orders is incorrect.

And there’s a gentleman called Juan Cicada from data. world that’s done a lot of work and reinforcement models for LLMs in the data space. And one of the , questions he always asked was we trust a human when we don’t know how accurate they are. Yeah. We ask a data analyst. To go give me an answer.

We don’t demand to see the SQL. We don’t demand to see the data they work. We trust that the humans done the job and often the job is not accurate. Yet, as soon as we see a machine do it, we expect a hundred percent accuracy. And we get upset when it doesn’t happen. So I want to talk about that idea of hallucination and especially in that use case you just talked about.

But before we do that, I just want to get you to give me your definition. When you talk about AI, are you talking specifically about LLMs or are you talking about something more broad?

Petr: I have studied AI , something like 14 years ago. So there were no LMs at the time, and it was already called AI. And At the beginning for me, AI was every sort of cognitive automation done by machines. So including not only neural networks, but also search algorithms like playing chess and this stuff.

But at the moment, I’m most of the time I’m speaking with clients and for clients, AI means LLMs that’s their definition. So I have accepted to my vocabulary, but as a Sort of expert in this field. I of course know the broader scale for me, it’s a sort of satisfaction , because for the whole history of AI, which started in the end of 30s of 20th century, the , artificial intelligence was always the running targets.

If it didn’t work, that was artificial intelligence. And when you make some algorithm, which works. Then it was renamed to something else like search algorithms or optimization or other things that was the serious stuff, the stuff which works. And the AI was again put to the places we cannot tackle.

And that was repeating again for the whole century. And currently. We are in the first moment in the history of mankind that the general public has accepted that AI works. That’s great. That’s achievement.

Shane: Yeah, I think, Siri and Alexa were an attempt at, AI because it was the machine listening to our voice and trying to do some tasks for us, but for some reason it didn’t stick. It’s amazing how, it’s the simplicity of the chat GPT interface that for a lot of us, again, my mother doesn’t use it, but for a lot of us that was the kind of switch flip.

But I’ll take your point, we used to do data mining. , we used to do stats. We used to do it a long time ago. It was different, but it was still a form of analytics. And I always remember the consulting companies come out with their maturity models, the steps of descriptive.

Predictive, those models. And funny enough, I haven’t seen anybody apply that maturity model yet to lms, but I can guarantee I’m gonna see it. So let’s go back to that use case, so this idea of using an LLM to take. Quite complex regulations and then distilling that down into a series of tests effectively that can be applied against another textual contract, and then a scorecard result to say here’s the regulations you pass, here’s the regulations you fail.

Given LLMs have a habit of hallucinating how have you dealt with that? How have you dealt with reinforcement and reliability around the LLM on both sides? Because you’re gonna want it to be as accurate as possible, interpreting the regulations in a repeatable way, and then also as accurate as possible in turn, putting the core contracts.

How did you deal with that from a process point of view?

Petr: Yeah, the reliability is basically our Number one topic , I think the only obstacle to see AI in everything is to make it reliable enough. And there are only two things , that can be done at the moment, and that’s the very precise decomposition of the problem.

And the second is testing, testing and testing. And the decomposition, what I mean by decomposition is that you must really understand the problem and decompose the whole problem. Whole story whole data flow to the most basic tasks and to only let the AI work basic stuff, not to let it anything which is complex and to give it all the necessary information and all the necessary instructions and to be very specific.

The problem which people at the beginning of the of very encounter with AI is that they start with the problems that they wanna AI to, to deal with problems that are difficult for them. I call it to approach, over the fence, like you are going on the street and you have a fence on your left side and you have a problem in your hands.

And what you do is take the problem and throw it over the fence. Then some mysterious things happens on the other side and then you expect to get the proper answer in your hands packed In the nice package, and it never works this way. You always must do as a designer, you always must go deep down to the problem and to understand it very precisely and decompose it to the small pieces.

That’s not different from any software application approach. So this is the first thing you must do to tackle hallucination. The problem is that you never know. If your decomposition is complete and if the data context is really describing everything and if anything is missing, then there is a space for hallucinations

it will eventually come. So it’s a iterative process. And in the iterative process, you must test in every phase and after every change. And that’s probably the biggest challenge for us at the moment to really properly test AI solutions. And we started to build some tools for it because the whole problem of testing is that nobody really wants to do that.

If you have a developer team everybody who develops software or have a team of developers, that developers don’t want to test their own solutions. They want to build it. And the typical approach of software developer is here it is. It works. Prove me wrong. And you have to have a bunch of testers who prove them wrong again and again, so they improve it.

And with Alama is a little bit more difficult because the software developer says, okay, my, my work is done. And it’s the stupid AI which makes the mistakes. And you must prove him wrong by a lot of testing. Especially if you have legal documents it means that somebody would have to read hundreds of pages of legal documents, searching for a sentence which shouldn’t be there or a statement which is misleading or something which is not exactly as it should be. And this is tedious and it costs a lot of time. In our statistics. It’s about 50 percent of the time from every project is dedicated to testing.

Shane: I think it’s it’s that problem we have in most domains. I call the big blob of code. So in the data world, people like to write large blobs of codes with many lines that start at the beginning, do all the work in that one bit of code, and then gives you the final answer, rather than decompose it down into small chunks of moving parts and then daisy chain those parts together.

Because when you Break it down, then it’s easier to see, effectively an input, small change, output, input, small change, output, and you can go through and test each one of those and figure out when something breaks. So if I think about that from an LLM, one of the anti patterns is the idea of large prompts, the idea of having a very large prompts with a whole set of instructions into it as a single prompt, rather than decomposing the prompts. Down to smaller chunks. But is that true? Does it actually make a difference?

Is it better to design systems where you have a series of smaller prompts and then maybe, bring them together at runtime? Or is the main way of designing it right now is a single prompt statement with all the moving parts within that single prompt?

Petr: It’s a double edged blade because the advantage of large prompt is that all the information is there. From the perspective of transforming the data, you have all the information in, and you can imagine the step that will, Make the transformation at once. You save tokens, you save time, which is important if the time critical application is needed.

For example, we are building chat agents for hotels, and if you get your WhatsApp answer in a It’s okay if it is two minutes, it’s too long. So the composition can be a tricky, especially if it is iterative. So this is something which is to the advantage of bigger prompts, but of course it’s really difficult to test it.

So the. Ideal from the point of view of reliability is to decompose it to the small prompt and to make a test set for every step in the chain. So we have the chain of transformations. Every step has a clear contract clear input, clear output. Clear data context and the contract, and you can make a test for these and make it really sure that it works reliably.

So this is the right way, but as usually as people who are writing texts says it takes time to make it short. Or to make it very brief. So this is the same case. If you want to make the clear design, it costs you a lot of time. Thinking and designing time.

Shane: Again in the data world, because I’m just mapping it to the domain I work in primarily , we have people that create these big blobs of code and when we talk to them about effectively creating what we call configs, so creating small chunks of instruction sets that are well formed, like you said, basically have an input contract and output contract and as little change in the middle as possible that makes sense, and storing those as metadata or config or a set of instructions, and then at runtime, it’s a lot more exciting.

Hydrating them together into a single prompt a single query makes sense because when we’re often transforming data, it is cheaper for us to run a large set of transformations on that data because the way the databases charge us we read it once, we do a whole lot of bad things to it, we write it back, we get charged a lot less than if we’re constantly reading a million rows writing.

But that’s an execution problem, that’s what you’re doing to optimize the runtime, yet the way we build the system is still in small chunks, and we just hydrate it all back together to that large query at the end. And so again for LLMs, I’m assuming that’s a good pattern to write small prompt statements that are stored in config metadata somewhere, but at runtime you’re effectively grabbing them, combining them, hydrating them, and then running them as a single prompt set because you get that speed of response and all of the stuff.

Is that true? Is that a good pattern?

Petr: In general, yes, but still there is a problem or a complication that you have a risk that , if you have a chain of small transformations and only gets the data you think that are necessary for the next transformation to be provided, that you miss some sort of context. Because what’s the repeating pattern in a way of designing is that you decompose it to this chain.

You think it’s perfect. And see problems and aha, this step don’t get, Some of these informations, and you start to add some information and some context and some data, and it can simply happen at the end, that all the steps basically gets all the input data from the previous steps.

Because they need them and at the moment, that you have failed in your decomposition and you must throw it again and try it, try again. And this process has its cost. And if you are lucky with a big prompt, then a lot of designing time can be spared for the price that the resulting application costs.

can be difficult for testing and and to prove it, it works reliably. So the sort is rather double edged.

Shane: That’s one of the differences between LLMs and data in my head, is if I have a set of data and I have a set of SQL queries and I run them, I know that SQL query is going to read the data. It has nothing else that it can do, so the only input to its contract is that data. But as we’ve been working with LLMs, there’s this hint that sometimes the LLM will ignore the input.

So let’s say I’ve got a 50 page PDF document. And I’m bringing that in as the core input. To the LLM and I’m giving it a bunch of prompts that says read it, do this, do that, and then I ask a subsequent question of the LLM after that first prompt, so PDF in, scanned it, here’s my prompt, it gives me an answer.

I ask a secondary question. Often you get the impression that it’s reading the previous prompt and the previous response as much as it’s reading the PDF document, and it’s also going out to effectively the big dictionary that it’s built as part of the core foundation model, this theory that when we have data in a SQL statement, there’s only two things that talk to each other.

As soon as we use an LLM, it’s not true anymore, is it? Because there is the input documents we give it, all the rags. There is the prompts that we’ve done now and previously in that session, and also in other sessions if it’s holding that anywhere in its engine. And there’s the foundation model for the LLM, and they’re all Things it can use to actually give us a response.

Did I get that right?

Petr: Yeah if LLMs that teach us anything is that context is everything. So for every decision for every part of the process, the whole information context is the most important thing. And people especially struggle in forgetting or imagining that something you know, you don’t know.

It’s a test for children , that you try to tell them some information and then ask them if another child without that information, what would answer to the question. And until I think about seven years the child things that everybody thinks everything that the second child will know what the first child will know.

And we have it still in our adulthood. If we read a PDF and make the Digest. We think that the digest is complete and it can be used for the steps, but we cannot un scene the PDF because we still have the whole context of the document in our head because we have read the PDF, but the AI doesn’t work like this.

If it doesn’t get the whole document. It has only the digest and in the digest probably something is missing from the contextual point of view. So it’s very easy to make this sort of mistakes. And again, we are back to testing so we can see the problems arising in this by properly testing every step on very different situations.

Shane: So let’s talk about that because it sounds like what you’re doing is decomposing the problem down from one big problem into a bunch of small steps and then effectively writing unit tests against each of the small steps. And then I’m assuming running and I’m, it’s not a regression test, running some kind of full system tests with all the moving parts after you’ve run the unit tests.

Is that what you’re doing?

Petr: Yeah, , that was my dream as a designer to decompose the problem to parts with well defined contract and for every part to write a set of unit tests, which means like 20, 30, 50 inputs, outputs couples And to check that the output of the part is correct.

And it’s semantically similar to the expected output in the test. This is the normal way how software applications are designed to decompose and test the various parts. And if everything works. You can say that you have 100 percent test coverage and you expect to work reliably.

The problem is that in LLMs, we don’t have equals operator to the texts. You cannot write something like asset equals three is free. Okay, three is seven. It’s not okay. This is the problem. And last year in, in the summer I think it’s simple. Just write some using amps.

So let’s take and write equals operator so I can use it in the test cases. Students give them this task for two weeks, write me the equal operator for text and didn’t specify it further because the beginning, I expect this will be, and it wasn’t. It cost us about, about a year of, truth itself and what does it really mean that something means the same as something? And what does it really mean that something contradicts to something? To get to the point that we have a quite good library for comparing text. And this library can be used , as , the key part of the testing process to show you some examples how difficult it is to, for example make a test text comparison is that you have two simple sentences, something like John is homeless and John owns Ferrari.

And from the mathematical point of view as we were brought up, it’s completely non contradictory. You can imagine Joe to ride with his Ferrari under the bridge, get out lay to the bench and cover himself with the newspaper and go to sleep and in the morning sit in the Ferrari and go to work and it’s completely possible.

But it’s extremely improbable. In the practical sense of view people who has a Ferrari are rich people who are homeless are poor and this is contradictory. So you must decide if you are building this application, what should be the answer of the model for situations like this. And there is a practical, a lot of practical situations of this pattern, not so extreme, but still happening.

This is just one of the many problems you must tackle when you are building this equals operators for text.

Shane: Assuming those unit tests, those equal operators for text. Are they widely available in the market now? Or is this one of the areas that people are innovating and experimenting with to build their own ones? Because the market really hasn’t caught up with this idea of actually having Test suites or test capability against the LLM patterns.

Petr: It’s interesting questions, big question, because as we see how useful it is to have these tests, we have make a public page and a public service and says for a few cents, you can use our equals operators in your applications and let us get some money and use it. And what we have expected was that a lot of people will We’ll use it, but we are a little bit struggling with marketing.

I don’t think that the market is at the point that everybody has really gets to the point that I have to test the parts. I think we are still in the early stage of the designing applications, and people are just happy that it sometimes do something useful. And also hoping that technology will improve and and the quality of answer will improve as well.

But as I know from other companies that are in the moment that really have to deliver reliable results, I think it’s That they are either building their own solutions or find on the market something which can help them. That’s something you cannot avoid if you really want to stand behind your results and says, It really works.

Shane: It’s like you said, we’re still in the early phase of the market. I was reading an article today about how Accenture is forecasting more revenue than OpenAI in the gen AI space. And the context was CXOs and large organizations want to do the AI thing, so they’re paying Accenture on average a million euros to get some people to come in and play with something and show some possible value, but never deliver anything into production.

And that just shows how early the market is that get away with building , the equivalent of PowerPoint in the GNI space, because either the CXOs don’t understand what can be delivered, in a shorter timeframe with a hell of a lot less money, or actually they don’t want the risk that it gets delivered and they actually have to use it and then bad things happen.

It’s a really interesting market right now that In theory, people don’t want it to actually be tested because it may prove that what’s been built is highly inaccurate , so we’ll see how that goes. If we keep going down this idea of testing, one of the things we find because we run Google Gemini.

So we run that as a private service for what we do. And then there’s a bunch of models within Gemini, so we can use Gemini Flash, Gemini Pro. And then because we’re using a cloud service, effectively Google will deploy new variations of the model in the back end of the service. And So that brings a whole lot of risk, because now you have this concept of regression testing, where you’re not testing the changes you’ve done, you’re testing the model in the back end.

If I liken it to data, it’s like the database deciding to implement the where clause slightly different next week, and we have to prove that every query we run is still valid because one of the foundational pieces we rely on is Just changed. Is that what you find? That actually, if you’re using a, model that is a service, not one that you’ve built, that actually that incremental change that keeps happening causes major regression problems?

Petr: Yes, I think when I said that that in the development phase, we need approximately half of the time for testing in the production phase, it’s three quarters it’s even more difficult or bigger ratio. Not only because of the changes in foundational model, but also because of the new data.

If you have even simple application like chatbot the data context will change over time and the whole process for example, loading the data from some sources Can be unreliable. So , it can happen , that overall your application is based on 1000 of pages of documents.

And in the next release some of the documents will not come and currently the coverage is only two thirds, for example, and you must detect that. Otherwise, your artificial intelligence starts to hallucinate. And if the application is critical for people in this industry, that could be Quiet disasters or at least harmful for example, one of our application is application for bank operators.

And they asking chatbots to the questions about bank documents and internal regulations. And if some is missing and the application start to hallucinate, they will telling clients these hallucinations which is problematic. Yes the regression testing is the same example, and in regression testing is even more important , to have a really uniform coverage over the input data and the context of the expected tasks.

So this is another topic, how to make really uniform coverage. Representative set of tests. If you have only 50 questions for the 1, 000 pages of input documents, you must quite care to have, for example, , one question for every topic or every part of the document , to have it really reliable.

Shane: OK, so it’s a volume game. So the more tests you write to test a specific unit test case, in theory you’re going to get an earlier warning when something’s out of kilter. Either the input data is not what you expect, or the foundational models change slightly, or something else has changed so let’s take the regulations. In my head, I go, there’ll be a whole lot of numbered clauses, 70. 1, 70. 1a, that kind of thing. And so in my head, I’m going you want a unit test for each one of those clauses. That I ask a question against that clause standard question, and I get back a very similar response.

Like you said, it can’t be exact same equals , this text, can’t be this number of letters or this number of words, but the sentiment or the context of the response has to be very similar to what I got the last five times. That’s where I was going, and then what you’re saying is there’s an extra step to say actually make sure that clause exists,

so if the clause doesn’t exist then you’re going to get a potential hallucination that has nothing to do with the model, it’s about the data’s not there. But is that what you do? So you just go, one of the tests is, Does this clause exist? And that’s a yes or no, that is a black and white thing.

And then if I run these series of tests, which are effectively these series of questions against that one clause, do I get similar responses every time? Is that what you’re doing?

Petr: Yes, the test is built in the way as a unit test, so you have the input, you have the expected output, which is something which was generated by AI and then validated with some expert. So you have the question, you really know it is true. And then you have a GPT answer. The G PT or AI answer and you are comparing the expected answer to the AI answer and asking are they semantically similar?

Are there any contradictions? And is there something missing and is there something extra? So some hallucination and this can be described as a metrics. We have a C3 metric, which has three parts. C1 is contradiction, C2 is correctness, and C3 is completeness. Completeness means that everything which is in the expected answer is also in the AI answer, and correctness is that everything in the AI answer is also in the expected answer.

These two are basically precision and recall. Standard metrics from machine learning, and they can be combined to the F1 metric, which is a harmonic mean of these two. So you can make this partial metric based on the facts and the contradiction is a sort of a harder metric, which can be used to really prevent some strong problems, if

one answer says that you can cancel the contract and this other one says you cannot cancel the contract. It’s a strong contradiction. It’s something like hard no. While the F1 is a lot of partial facts elementary statements that are compared in one and the other answer. And also this is a very tricky part, because when I say elementary statement, this is something which must be precisely defined.

If I have a text and the text says that John lives in Canada and he’s 30 years old, then there are two elementary statements that John lives in Canada and that John is. 32, this is simple, but if you have more complex cases, it’s quite difficult to say what is elementary and how they split and you must have it At least some sort of consistent over multiple non deterministic run of of AI.

So this is something where statistics comes in and you must have more than one Test question, you have to have 40 50 to get some reliable statistics about how it is working.

Shane: So that’s almost the old data mining patterns where, if you’re looking at a churn model, you’d go and identify some people who had churned, you’d create your model, and then you’d effectively test whether the model was going to predict that they’re going to churn with the historical data and then say, yes, we’ve now got something to aim for.

So that idea of a human looking at that regulation and knowing what is true is critical, you Producing a baseline or a known state or something like that, that you can then test against.

Petr: It depends on the domain, of course. If it is legal text, you really need to check the test set with the expert. You can’t do it on your own. On the other hand if the application has some basic domain, like for example, if a cooking application, then probably you can make the check on your own.

Shane: There was a use case in New Zealand of a supermarket having LLM generated recipes on their website. So you put in a couple of ingredients and go, what can I cook? And then inadvertently it started combining some ingredients in a way that were dangerous and poisonous.

With a lot of the predictive models we used to do, we used to say a predictive model to send some marketing out. So a prospect can have a lower level of accuracy compared to a predictive model that is predicting the efficacy of a drug we’re about to give a patient,

because the blast radius, the impact of getting it wrong is so much different. I’m not sure with LLMs that’s quite true anymore, because that recipe example, or what was the other example? Was it a airline company that used a LLM generated chatbot that then somebody managed to prompt the engineer to give them a flight for a dollar and then went to work?

to the Commerce Commission and enforced it because the bot on the website was giving them that offer and it was valid. So I think we’re going to start seeing a whole lot of use cases where people aren’t testing the LLMs properly have negative consequences for the organizations that pay them to do it.

But just on that. Again, back in the old data mining days, we used to grab a data set create a percentage sample, percentage test, percentage holdout, so that we could, create the model, run it again, and then test it against a holdout set of data to see whether it had overtrained or not. Is that part of what you do within the LLM? I suppose when you talk about subject matter expert, looking at the clauses in the regulation policies, that effectively is potentially Creating a holdout data set? No, it’s not really. So that idea of, sample, test and holdout doesn’t really apply to the LLM space, does it?

Petr: I’m not sure I really understand the question at the

Shane: In the old data mining days, we’d say here’s a million transactions or a million rows of customer behavior. We want to create a churn model. So we have a bunch of customers who we know left. So we take those million rows and then we’d basically sample them. So we’d have a third of it that we developed the core model on.

We then take the model and turn model and then apply it to the second third to test

Petr: I don’t think it’s it’s the same situation because In the train tests plate, we are relying on the fact that we can test thousands and hundreds of thousands of transactions in one run during the night, for example, and that’s something which can be done. Quite cheaply.

On the other hand in the case of our alarms, we are paying for every test. It’s not much, for example you cannot have 10,000 questions for your thousand pages pdf, so you cannot cover it. Completely, sentence to sentence, you must make some representative set of questions, like 50, 40, 50, 100, not more, because otherwise the tests would run forever and you will pay too much.

So this is a decision you must make. Make as a designer. And the way we do it is that we have a tool that reads the document and from every section and every part, it prepares something like five questions and the proper answers, and then we ask the expert. to take the most important questions and check the answers.

He only reads the questions, to decide , which of the questions are really important. And then he check answers only for these questions, which is saving time. And then we have the representative set. I agree with you that , it would be even better to have secondary tests, because when you are developing the application, it can happen That on the first set we have quite nice performance and on the second set, we will see the drop, but at the moment we don’t see the drop so we don’t have it in our production way of building applications.

Shane: That’s an interesting technique As we’ve been talking, I came back to this core pattern and I’m wondering if it’s true. Because we know that LLMs are non deterministic, which in my understanding is that we can ask it the same question five times and not guarantee we’ll get the same answer.

Although some people are talking about they’ve built deterministic models and I don’t understand that, so we won’t But effectively what you’re doing is you’re running a non deterministic engine to get the answer, but you’re writing deterministic tests. Is that what you’re doing? Because the tests are deterministic, because what you’ve said is this human looks at that text and says, here’s the answer,

that is deterministic. That is the answer. There’s no inference in there. And then when you’re running everything else that’s non deterministic, you’re comparing it against that ground truth, against that deterministic statement to see how closely it fits. Is that what you’re doing?

Petr: yes at the beginning we thought that it will be more non deterministic. In the design of the tests, we made them to run multiple times and average the result or take the minimum of this and says, okay the minimum quality of answer is 0. 8 so this is our takeaway from the test.

But as the time goes we saw that the non determinism is not very strong. And if you have 50 questions, then the average effect of these 50 questions will counterbalance the non deterministic quality of each of them. So you can have a situation that, for example, one particular question will be 80 percent of time correct and .

20 percent of time it will have some mistake. And another question have a similar pattern and another one have a similar pattern. So if you have 50 questions and average the results, you will on average get approximately the same number. The non determinism will only manifest itself if you have one test question and you look on a particular test question.

But if you have 50 of them, it’s okay already. So it’s not a strong non determinism and the answers are uncorrelated. So the errors are uncorrelated, so the central limit theorem will save you.

Shane: You talk about 50 percent of the development effort is testing 75 percent when you get to production. I’m used to 80 percent of any work we do being data, so you want to create an analytical model. You want to create a visualization or a dashboard. That’s great, but 80 percent of our time is always spent.

Making the data fit for purpose. So given you’re already taking 50 percent testing, how much of the work for an LLM, for an AI agent use case, is data preparation versus testing and then what’s left? Is it really just prompt engineering?

Petr: This is a great thing and I really love it on LLMs is that you don’t need to spend so many time by data preparation because most of my career I do data science and all the data science projects was something like Okay, we will deliver you the results eventually, but first we have to understand the data and the data understanding will cost us three weeks, one month, two months, depending on the complexity of the step.

And it’s quite a lot of money. And the customer answer many times was, oh we don’t think we have to pay for it because we don’t have the assurance of the result and you will spend so much time on data cleaning and data preparation.

In LLMs the great thing is that you can take examples. from scratch and make a working prototype, make something which do something and show it. And it costs you one day or two days to make this very simple prototype and you show it to customer and he sees it and says, okay, this works.

I believe in this idea and let’s go deeper. So the starting of the project is much cheaper and much, much more easy. So this is probably the thing I love the most on the alum projects from the project management per perspective.

Shane: What I’m seeing a lot in the market now is this move to agents, this move to chunks of things that do a specific task rather than, chatbot that does many things. It seems like a natural replacement for RPA, so there was a whole move. A while ago for, Robotic Process Automation.

So this idea of, typically you’ve got SAP there’s 15 horrible screens, some human has to type it into you can’t afford to build integration via API because you’ve gotta pay the SAP text. So you. Find a cheaper way of doing the integration, which is some form of screen recording that pretends it’s a human and takes an Excel input and clicks on the screen and fills in those fields.

There seems to be a kind of buzz in the marketplace that LLM agents are going to replace that. Obviously not the same way, but they’re going to take these micro pieces of work that are human heavy completely inefficient, and the agents are going to replace them. Do you think that For the next 12 months, let’s look out to the end of 2025.

The sweet spot for LEMs is still that agent between a whole lot of information and a human doing the task, you talked about going from 14 days to two days, it’s the agent doing the heavy lifting at looking at volumes of information to inform a human what to do next.

Is that where you still see the sweet spotters? Or do you see the full automation of tasks using gen AI and LLMs? coming a lot faster than the next 12 months.

Petr: You have named , two main, as you call it, speed sweet spots. I think automation is number one. I like automation tasks because they have very clear definition and also the business case is easy to calculate. So this is something which we try to do a lot. The most impactful is the.

Working with unstructured data like PDFs and and texts and images and scans and all the other things. So processing of CVs, processing of contracts, processing of other things. When at the beginning of a lot of work, especially in banking or other industries is that you start from plenty of documents and you are processing them in.

in a way into some structured form and do some checks , that your mortgage contract has proper clauses or that in the CV are the proper technologies that you need for your project, et cetera, et cetera. It’s not very difficult, but there is a Plenty of people who are doing this all the time, and I think this is one of the biggest place where the LLMs will be very handy.

I don’t think there will be handy in the robotic processing of structured data, because when the data are structured, most of the things can be described explicitly , in code, and you don’t need to understand the semantics complexity, like in the unstructured data.

So I think at the beginning, mostly and on the end, when you are formulating the answers for humans, so this is the second part you called that LLMs can overcome the semantic gap between the machine and the human. For example, as you mentioned, the SQL assistance, so the application that you can ask questions from database and it will generate SQL and you can either Check the SQL and ASK database on your own, or you can just consume the results.

And it don’t have to be SQL database, it can be any other data source. So they are just gates into the world of knowledge bases for humans. So these two examples at the beginning of the process and at the end of the process are the, Proper places for all of this.

Shane: When we look at software engineering, a while ago, the idea of test driven development came into the software engineering domain. So that idea of instead of writing a specification, then getting a developer to code the specification, and then getting a tester to read the specification and check the code and go, they don’t match.

And nobody knowing whether The specification changed or the code was wrong because often, the specification was a guess. And as we worked on it, we learnt some more things. So the code changed because the requirements changed. So test driven development came in with the idea that I can write a test and then I can write the code and when the test pass, I’ve done my job.

In the data domain, we never really adopted that. Like test driven development for data is never really being widely adopted and there’s some challenges to a degree. So for example, if I’m going to write a query of how many customers I have, then where do I find the expert that tells me what the assertion is,

that tells me actually there are 42 customers or 1042 or 42. And given that set of data that actually is the answer. And often that’s hard, often what we do is we write the query. And go million and 42 customers and our subject matter expert goes sounds about right. Let’s go. So there’s a bunch of reasons why we’ve never really adopted test driven development in data, and it looks like we’re not developing test driven approaches in LLMs at the moment,

so everybody’s just throwing a bunch of unstructured text documents, videos, audios into the Magic LLM bucket. Doing a bit of prompt engineering, maybe some RAG reinforcement, maybe a couple of checks and then going it’s not deterministic, it’s good enough. Is that what you were saying? That actually very few people are actually taking, a test driven approach to create this idea of reliability within non deterministic AI agents.

Petr: You have hit a painful place because work in a software engineering company and I was the first data scientist who started the data branch 10 years ago. And for the whole time we was fighting this battle because the software development world is really Strictly process oriented , to have all the faces in the right order and to development and to have a GitHub pipelines and process of merging the code and version control and configuration management and plenty of other stuff, which is very precise to get the software be really reliable at the end of the day.

And the data people are, as you call it, a little bit more statistically oriented, and they just put data on one whole pile, and they are making some statistical tests on them, and build models, and then throw them away and build another one. And Spend quite a lot of time , to make some processes in the data data domain as well, and be able to deliver reliable results.

It works. It works in a way that if you want to have really reliable machine learning pipelines you have to adopt these processes as well. It’s difficult. It’s called machine learning operations and it has an extra layer of complexity because you don’t only tackle with the versions of code and versions of models, but you also work with the versions of data and you have the problems of we are learning on these data, testing on these data, predicting on these timescale and etc.

It’s complex. It’s difficult, but if you want to build a model which really works on a daily basis and it makes some predictions which can be used, the processing is necessary, but the disadvantage of this approach is that you are Slightly drifting from the way data people in the customer side is thinking.

So you are more like IT guy and less like data guy and they start not to understand you very well. You must have a proper level of processes, not to lose them, but still be able to deliver the results.

Shane: What you’re saying is when we’re data people, we have to be somewhere between the DevOps, software engineering, automation, everything, rigour, and Excel. Completely ad hoc. Don’t trust any formula. And you’re saying that within the LLM space, we’re going to have that same problem

because if we have to engineer everything in the DevOps approach, we put in so many robust and reliable processes that we lose the value of the speed that we get from an LLM. So it’s a time to market versus accuracy trade off decision is what you’re saying.

Petr: I think on the level of assistance, the applications that people are using to help themselves in their works, you are more in the data domain, but in the space of agents when you want them to make reliable decisions without necessary Check by human, you are more in the domain of software development, because you really must rely on the decision of the machine.

So it must be reliable. And also, one extra idea. I think the processes, the high level of very important. Precisely specified processes, together with prices of developers, which is high in all places around the globe, will probably be the result why the replacement of software engineers is one of the first which we will see in the market.

Because if you have a proper global GitHub development team, which have all the processes set correctly from the specification to the development integration, merging, testing and so on, then it is rather easy to get some agents to the team. And just put it in the processes to the right place.

So I think it’s just a matter of a year or two, when in a GitHub, you can click to your team, three or four agent AI developers who will just follow the same process because the process is so well defined, it will be the least problem while in other domains, When everything is in a head of people and it’s not specified and written anywhere.

The task is much more difficult.

Shane: So that’s an interesting one, isn’t it? That

with the software engineering, because they’ve moved to DevOps, where things become immutable, they get well factored, they’re proven, they’re repeatable, they’re visible. It’s the perfect thing for the LMS and Geni to target, to take over to replace because it is so proven.

Whereas everything in Excel that’s in the finance person’s head and those magic formulas that nobody understands, it’s really hard to build anything. Potentially does the same numbers because nobody knows how to test it. So intriguing and you’ve actually given me a great visualization. So if we go back to this idea of ask AI, assisted AI and automated AI, I can draw a continuum where ask AI is on the end, assisted is in the middle and automated is on the right.

And then I can map it to what you just said. So on the left, ask AI is effectively Excel. Assisted AI is ad hoc ops, that data behavior, product behavior where we’re semi automated, but there’s still a whole lot of creativity. And automated AI is DevOps, where it’s fully repeatable

which means, ChatGPT really is the new Excel because everybody uses Excel when they want to go crunch some numbers themselves. And from what I’ve seen, everybody goes and uses ChatGPT when they want to gen AI or LLM themselves, so actually, ChatGPT is the new Excel, which is why it’s been so successful.

Petr: it’s a really good metaphor, good for t shirt, I think

Shane: So just before we close out, because it’s been a great conversation. Is there anything else you want to cover before we close the session out?

Petr: Probably one thing, and that is the building of trust of the customers and the users. That’s probably, I think, the most important thing. And what we do is I call this switchboard strategy. You put your automating application You decompose it and for everything you have a switch, which has three sets, fully automated, partially automated and fully assisted.

At the beginning, all the switches are in the, Full assisted regime, and you are using it, and when some of the part is working well, you can switch it to the semi automated or automated regime, and the goal is, at the end, to have the whole thing working in full automated regime, and the user is in control, so he can switch to more automated and less automated, depending on his decision.

The semi automated can be something like all the contracts will be processed automatically, if they are over a million of the currency, then I want to check. And this is something which I think is a key , to build the trust of the users gradually. I don’t think it is possible to come to the NA customer and says, from now on, we will switch to AI and it will do the job from tomorrow.

Shane: So what you’re saying is there’s going to be a new market segment, because we all love to build small pieces of software when something comes out to solve a specific problem. And so you’ve just talked about agent orchestration, we’re going to see some workflow tools that come out and actually orchestrate the agents.

But the The level that the agent is going to operate at and the approval processes and the if then else statements. So if contracts under a million do this, if then do that, we’re going to see orchestration and workflow engines for agents, it’s got to turn up,

Petr: And also my final word try my AI for the testing there’s the library for automated comparison the equals operator for test. I think it’s a very good tool and we will be happy to get any feedback from anybody

Shane: And on that, so if people want to get hold of you, if they want to chat to you, what’s the best place to find you on the interweb?

Petr: Well, LinkedIn, and also there is a contact on the page of Eval. My. ai.

Shane: OK, and I’ll put the links in there and the show notes for anybody that wants to get hold of you.

Excellent., that’s been great. I’ve learned a lot more about reliability, engineering and AI agents, so you know, I suppose we had the term SRE, right? Software reliability engineer coming out of the big fan company.

So, you know, Maybe we’re going to get a RRE or an you know, a reliability agent engineer, something like that. Something to come up with three letters that, that always work confuses us.

Petr: I am sure that will happen. Or something similar.

Shane: All right. Look, it’s been good chatting and I hope everybody has a simply magical day.

Reliability Engineering of AI Agents with Petr Pascenko

Guests

Resources

Listen on your favourite Podcast Platform

Podcast Transcript

Read along you will

Fractional Data Team

Common Team Problems

About