Agile and Analytics – Shaun McGirr
Guests
Resources
Podcast Transcript
Read along you will
PODCAST INTRO: Welcome to the “AgileData” podcast where we talk about the merging of Agile and data ways of working in a simply magical way.
Shane Gibson: Welcome to the AgileData podcast. I’m Shane Gibson and today I’m talking to Shaun McGirr. How’s it going, Shaun?
Shaun McGirr: Good. Thanks, Shane, good to see you again.
Shane Gibson: It’s been a long time. So I think today we want to talk about Agile and analytics, and that will rift into teams and people and process and all the good stuff. But before we do that, how about you give a bit of a background about yourself for people that don’t know you?
Shaun McGirr: Sure. Today, I work at Dataiku, leading a team of AI evangelists. And before that, I’ve had a long career doing all kinds of things with data on many different technologies. And leading data science teams being the first data scientist in a place and realizing on the second day of that job, they don’t need data science yet. That time, it was lucky. I had worked for you at Optimal BI in Wellington and had learned lots of other useful things about all the rest of the data stack and the data practices. And I got my start in data by working for Statistics New Zealand, as a holiday job, doing admin things. And one of the tasks that someone gave me one day was to print it out listed numbers, I should say. And they asked me to tell them what numbers were in ‘List A’ but not in ‘List B’. And at the time, I didn’t know what a merge or a join or really anything was. And I didn’t want to rock the boat upfront. So I sort of looked through and struck things off with a ruler and a pen, I guess this is 2001. And then I realized those printed out numbers were probably in a file, can you email me the files, put them side by side on the screen. Let me copy that data from there to there. And that was kind of how I realized, if you stay curious and interested in solving these things that we call data problems, people just give you more interesting work to do.
Shane Gibson: And as you said, we worked together many years ago in sunny Wellington, I think from memory that was before the term data scientist actually existed, wasn’t it?
Shaun McGirr: It was absolutely right on the cusp. So I had come there towards the end of my PhD program, and had thought, I’ll just finish that off in 6 or 12 months, because that’s easy. And it took longer. And in the meantime, I fished round four ways to apply the data, econometrics, statistical modeling skills I’d gained as part of my PhD. And at the time, I think we called it predictive analytics, advanced analytics, things like that. And I Googled some things and Optimal BI came up and I sent you an email and said, “Hey, I know how to do these things. Do you do those things?” And then I think we started working together. And I think that was maybe early 2014. And it was just when that term data science was really starting to take off, which all terms like that, they have a usefulness.
Shane Gibson: Yeah, and so let’s talk about that one. So what do you reckon data scientist as a term? Is the vision being realized, and we’re moving on? We still in the world where that term has still finding its feet, or did it crash and burn?
Shaun McGirr: I think the term is not going away. It’s just here to stay like many others. I remember lots of predictions over recent years about how there’s going to be increased specialization. So people will become a marketing data scientist or geology data scientist or finance data scientists that still doesn’t seem to have happened. And the reason that I still like and respect and maybe use the term is that I still believe, even God forbid, and that Venn diagram of the value of a person who can live in multiple worlds at once, who knows about statistics and the usefulness of statistics in understanding how machines have learned patterns from data. And the ability to make those machines do some things at some scale. And most importantly, and the one that I always lean on the most, and I’m most thankful for the training in my career that gave it to me was the domain expertise. Do you even know anything about what you’re talking about before you look at some data or write some code, and the great lesson for me has been any domain expertise from anyone I’ve hired or worked with any domain expertise in anything helps them gain domain expertise in other things. So the term used abused. Employees claim that they can’t find enough, but then everyone knows people with great backgrounds and skills who can’t get hired. So it’s all messed up. It’s not helpful. But I still believe in the value of a person who can do those three things in that crazy Venn diagram.
Shane Gibson: Yeah. And for me, I always know, we started out with three, but I formed into four steps and data science or four sets of skills. So definitely with you in terms of the stats background, the ability to experiment, treat it as a science project, prove the hypothesis and use mystical terms to do your ways of doing that. Definitely around the domain, understanding the business and why it’s important. And the next big section, or some actionable insight, I think we’ve done post about this. What are we going to do with us? What value is it going to have for the organization? How’s it going to help us make a better decision or increase the value or reduce the risk or whatever? I think there’s definitely when we talked about the data scientist unicorn in the early days, those equivalent of a full stack engineer, that person that can do all four things. There was a whole thing around business facilitation, plain English language, being able to talk to C levels in a way that they understood, I think that was a key skill of one of those unicorn scientists. And then the fourth one for me was engineering. They could write code, they could do a Push, Pull, PR request, Get Merge thing. And they engineered the code, so it was reusable. And what I’ve seen when you talked about data science specialization, they’re kind of tweet mantras, because now we’re starting to see specialization again in the engineering domain. So we now have data engineer. I suppose it analytics engineer and MO engineer, blah, blah, QA engineer. We’re going into the hyper specialization, DataOps engineer, but I’m with you, we don’t see the specialization for the data scientists, because we’re specializing on that fourth part of the Venn diagram, the engineering part. But we’re not really specializing on domain, you see a supply chain data scientist. This is a marketing attribution data scientists, haven’t seen it happen, which is interesting. But what I do see is from an Agile point of view, we got good at combining the data people and the visualization people and bringing them together. We removed often, this team did the data work, and they handed it over to the Vis team who did a pretty dashboard. And then the dashboard went to the user and the user went, well, that’s not the right number. And it’s not the right way. And we bought and blamed everybody else. We started merging those people together creating teams or squads that work together. But the data scientists tend to always sit outside, it was in the old days. We made ETL developers because I am that old, the analytics people, they used to sit in another team.
Shaun McGirr: They floated somewhere, didn’t they?
Shane Gibson: Yeah, that’s someone secret. Now the ETL team used to hand over data in the warehouse to the analytics team and see what happened. Do you see that specialization that data scientists still sit outside our core data team and the organizations that you work with?
Shaun McGirr: Yes. The number of places they sit is bewildering. And in my career, I’ve been that kind of person and set in many places most recently aligned to strategy. And it’s really interesting, I think part of it is that the most charitable interpretation I can give is that, one of the best things you can give data scientists to do if they do those four things, or if the team of them covers those four things you said because there’s no one that does those four things that you said. If that team has those qualities then they need to work on very complex questions and solve those questions and problems. Otherwise, how do they justify their investment? And to do that, they’re going to need to look in all the corners and under all the rocks of the data. And having been that person having received that data from that ETL team, or data engineers, or whoever, I’ve always been suspicious about who made the decisions about what data was not in this dataset in particular. And so the most charitable interpretation is, if you get your data scientists working on those hardest things, part of their value isn’t actually questioning all of the other decisions that everyone else has made about data. And so that doesn’t make them very good friends of anyone else in a data organization.
Shane Gibson: Yeah, it’s often for me, when I’m working with new data and analytics teams, I get them to focus on handoffs. So even when they’re working, if they’re not pipelining, where they’re handing across teams, they have to skills within the team so they can do all the work that needs to be done. They still hand off across people. And so the one of the focuses I get to do is how do you handoff? How do you describe what you need the next person to do in a way that they can understand? And we know it’s not 100 page requirements document. I just need five columns, and one of them is called ‘H’. So some of the stuff I’m seeing now is this 10 feature engineer. Which really to me, it seems like a person who writes code that creates the feature flags and a bunch of columns to enable a data scientist to run the algorithms while the machine learning models. But what I struggle with, again, is now we have a handoff problem. So how does a person that needs data for the model, explain the features they need in a way that somebody else can write them, and prove that actually, those features are the ones they want, that there’s not being biased, and the rules are being used to create those features that they haven’t seen that affect the model. And therefore, it becomes easier for them to just create their own code. Because they are exploring the data, the same features that may be important may influence the models, they then create those features, they put them in the model and they see what the impact is that closed loop happens amongst themselves. It’s faster and in theory safer for them, but it’s not scalable. Right now we have a person that is trying to boil the ocean. So what do you see, do you see that feature engineering is been devolved outside of the science role?
Shaun McGirr: This reminds me of a conversation I had with a Facebook data scientist, maybe three, four years ago at a meet up group, when they gave a presentation about whatever crazy cool stuff that they work on. And afterwards, I asked them about one of my boring day to day enterprise data scientist’s questions of how do you get the data that you need, because you have even more data than I am trying to go after? And I can get people to give it to me, and you surely have all the data. But how do what to look for? And how do you make sure you’ve got the right stuff? And the guy looked at me like I was an alien. And he said, we have an engineering team, and we just slack them what we’re after. And then 15 minutes later, they point us to an s3 bucket where that data is sitting. So many of these new jobs and new ideas are generated out of companies with effectively unlimited resources to do this work, which is not the reality of 99% of everyone working in data. So that’s one end of the spectrum is gargantuan companies with infinite resources. At another end is the scrappy individual who has to make it all work themselves. Otherwise, none of it will work. There’s another interesting one, which is the company called Stitch Fix, who have a recommendation engine driven clothing subscription service, and a very impressive data team and a chief algorithms officer. And they believe the opposite of either of those two positions, which is that a data scientist should be able to acquire all the data that they need, experiment with models, production eyes and maintain all those models. And of course, that means that there are 200 data scientists are supported by 100 platform engineers to build all the things that make that possible. So I’ve not seen feature engineer as a job. It strikes me as an extraordinarily bad idea for traceability, for productivity. And also I think the thing is sometimes whoever is pushing their if they’re a data scientist, and they think that they’re getting away with avoiding some hard work, if you’ve got other people choosing what data to put in the model and machine learning is increasingly automated, you’re pretty soon going to be out of a job.
Shane Gibson: Yeah, do you see when we’re working with teams often get them to experiment with pair programming. And it’s a really interesting experience every time because for a seasoned developer, that seasoned engineer, typically they’ll come back and go, but there was let me do. Staring at a screen with somebody else, just let me get my headphones on, let me bash the keyboard, and I’ll get it done for you. But when they’re open to experimenting with it, and they do it for a while. Typically, what we get back is actually having somebody sit next to me, maybe think a little bit more about what I was doing and why, because we’re talking, they’re asking questions that stops me going down rabbit holes. Normally, I’d go down a little sideline for three or four hours. And often the person I’m working with goes, maybe we don’t need to go there, or we just do this. So they get some value back by having a peer. And we all know that when we use that approach, the quality of the code that comes out as much better for various reasons. Do you see that happening in the data science world and the analytics world where a data scientist and an engineer will actually sit together and work on the same code to create those features?
Shaun McGirr: It’s an extraordinarily good idea. And it’s one that organizations are just engineered to be allergic to, because this idea that you could get more value faster by having to what the bean counter looks like, two people do one job. This is not how any organization understands how it creates value. Personally, when I’ve been able to achieve that, and it is easier actually sitting physically side by side, it is tougher over a shared screen. Because then you’ve got also got all the other distractions of having five monitors and notifications. But the times when I have worked side by side with another data scientist, or an analyst, visualization type, or an engineer type, all those benefits have become clear to me. And one thing that I always wanted to do when I was leading teams, and never got a chance to do, but I’ve seen a few customers push it this far in Dataiku is to have those cross functional teams working on our problem, because one of the things that makes creating the time for peer programming difficult is that well, because all of these data projects are such uncertain bets. And because we have to promise so many things to so many stakeholders, we spread ourselves thin. So we’re all working on too many things at once. And so we can’t even find the time to do all this work together. So what I always wanted to do was to say, to reach a point where the thing to build was so well defined that you would take an engineer and a data scientist and two analysts and you would make them a team and say solve their problem, and you have no other job while you’re trying to solve that problem. And in that context, that version of Agile would be revolutionary if we could ever get data teams to that level of clarity on what they’re trying to build. And that level of insulation from BAU and multiple priorities and waiting on data.
Shane Gibson: Yeah, essentially. So one of the things I’m really passionate about is Agile is a bunch of patterns. And those patterns may or may not have value to the way you work. So while I like a lot of the scrum stuff, and flow stuff, to me, they’re patterns. So I encourage teams to do is find out they’ve got a problem, and then find a pattern from the Agile world that may solve it, experiment with it, see if it fixes your problem and then go on to the next one. So that’s great. But often, we need to start somewhere. So I will often encourage a team to start off with more a Scrum centric way of working. So I articulate that. Well, actually, rather shallow way articulate, it’s a bit of me, and I use those terms, as we are doing small batches. So we have a team that are dedicated to doing one thing, one data product for a period of time, and we’re trying to reduce the cycle time for their batch right to two or three weeks. And ideally, we want to move to more flow modeling based ways of working with data because it’s more natural for the problems we have to solve. But if we think about that from a data point of view, we’ve started to see traction of Scrum being used and data teams, copy the Spotify term a squad of people that are cross functional and T skilled, they can collect the data, combine it, consume it, present it, but we don’t tend to see that in the data science world. We tend to see the data science teams sit outside that team. And one of the reasons I think is coming back to that comment you made about knowing what you need to build. So when we have a team building a data product or an information product, we’re getting better at understanding what it looks like, what’s the output, look at the dashboard, and it’s got these KPIs. And we still haven’t figured out what value we’re going to get from it, ideally, but we know what it might be. And we’re getting good at reducing the scope. We’re saying, “Well, look, doesn’t have to boil the ocean”. It’s not all the data, it’s just a bit about a customer and a product and an order. And so we can reduce our scope down and get it within that batch. But from an analytics point of view, we’re working on a hypothesis. It’s a bigger problem, there is no one set of requirements. There’s a theory that I can grab all this data. And if I use some kind of models, it can recommend to the users what they should do next, what movie they should watch next based on that, and that’s all hypotheses, we don’t know whether the data supports that we don’t know whether we can create a model that does that we don’t know that that model actually has value and the user would find it useful or annoying. So we’ve got a whole lot of hypothesis testing. And for me, that’s why having T skilled teams and analytics and micro batching, the work to be done as a lot harder. What’s your theory on that?
Shaun McGirr: Yeah, completely with you. Every time I’ve tried to Scrumify data science work, you end up with a whole bunch of tickets or whatever your construct is that explore data, again, fix data, get more data, explain data. It’s so easy to generate a series of discrete steps that still lead to absolutely nowhere. Because the actual steps taken to test that hypothesis would typically actually be much simpler. What I think the deficiency is having organizations form hypotheses to begin with, because the hypothesis is doubt. Hypothesis is uncertain. Hypothesis could be wrong. Hypothesis can’t be promised to be true. And so if you have data scientists and you set them out on these ambitious problems to solve, you have to accept that in many or most cases, the answer will be “There is no solution that’s useful and feasible and implementable”. And so a big part of the work of data science, any science actually is to disconfirm to disprove hypotheses. And investing in being wrong is difficult for everyone. And I think, over time, what I’ve settled on, is that you either need to split data scientists into two groups. One, who actually prefers the world where they test hypotheses and are mostly wrong, and a different group who take the ones that have been somewhat validated, and turn those into products. Or, you have to have the same individuals be able to switch between those two kinds of work. And depending on the team size, and the people that you have, one of those would be a good way to do. And the way I’ve visualized that today is you need a small cog, very rapidly turning to very quickly invalidate bad ideas. And everything that survives has to go on to a much larger, slower turning cog still powered right by the same little cog, but it turns much slower. And that’s when the ideas that survived need to be implemented. And that’s when it’s mostly politics getting other people to do stuff, enlisting engineers making things robust. So that view of the world that I have is a little bit inconsistent with the view that that others have, which is that every data science experiment that you start should be built to production grade from the very beginning. At times, I’ve believed that but I just don’t think that’s a way to solve the problem. You outline that you don’t even know what is worth building out of data science in the first place to begin with.
Shane Gibson: Yeah. So essentially, because in the Agile pattern world, we have this idea of a research spike, which is these are High Level uncertainty. So there’s no point going and actually getting the team to work on something until we have more clarity on what it might look like? What it might take? How long it might be? Some hence that this is worth investing in. So I definitely agree, I think that idea of bringing research spikes into analytics, where we have a hypothesis, and it’s a high risk hypothesis. So we need to do some early work upfront to see whether it’s got leaks. And we also fail fast, but nobody believes that. Because since you fail, everybody jumps on you. But those research spikes, most of them should fail. Because you’re starting to explore things of high level of uncertainty. And then once it looks like there’s some possibility that this thing could have value that could be done, then it goes into a second round which is, in my view, different people. Because of the different mindset. Right now, I’m saying, we know this is possible but it’s still hard. How do we do it? And I’d probably add a third one, which it’s just turning up in the market, which is a reinvigoration of things that happened 10 or 20 years ago, as most things are. But it’s this idea of, of analytical models that have been done time and time again in other places, so the pattern are known. So if I take the concept of a recommendation engine, that’s a pretty well-known pattern now. So the data is always different. We did an experiment where we had our startup, we’ve got data catalog, so a bunch of tiles that you can see the tables that are there and the data that exists. And I wanted to see how easy it would be to create a recommendation engine under the covers, so based on who viewed the data, or who use the data, and what you’d look like? It would recommend so effectively copying the model from Netflix, but for data. And so we did a recent spike on that we call them expired keys. And we thought this model was in the third cog, it was a proven model with lots of patterns. And we’re on the Google stack. So there was a product called [inaudible 00:27:07] and had a recommendation model for shopping carts. And it was just as much of an out of box model is ever going to get. So we did a mix spiky on that and it killed us. Because that model was so tailored, the implementation of it was so tailored for user sticks Omega in a shopping cart.
Shaun McGirr: Let me guess, ecommerce website with product description text? Maybe images with colors in them.
Shane Gibson: Yeah. But think about it, that for me to apply that model or for my team. So because I didn’t do it. It was like, we’ve got a tile. That’s just a product. And we know somebody clicked on it or viewed it or queried it. So that’s just some kind of feedback of that it was valuable. And all the model said was there’s a thing, and there’s some feedback on how often that thing gets used. And therefore, you’re going to look like that. And I’m going to recommend it. But it didn’t work for us. I mean, the key thing for us as part of spiky we tie [inaudible 00:28:03], we see we have this amount of time to prove it has value or it doesn’t. And we know right now, it doesn’t have value for us. So we didn’t do it. So for me, even those reusable ones, you still got to do the first little cog. If I take that third, and I apply it, is it going to work? What’s the risk level? What’s the value? And then again, we’re going to invest in it. So I think that’s what we’re going to see more and more of those third cogs, and then we’re going to see people who specialize and taking those known patterns and outlooks and applying them. And that will become because we’ve seen this before 20 years ago, industry models, or domain models, or those kinds of things, and people will start to apply those.
Shaun McGirr: We talked about that five years ago then didn’t with our shame.
Shane Gibson: Yeah, I think we talk about on a regular basis.
Shaun McGirr: One day, someone’s gonna define the Meta churn model and Customer churn model, Customer lifetime model. I think that you’re right, that the patterns on how to do it are clear the underlying mathematical algorithms are clear though. We have the computation now to do it all easily. What’s different across the cases is what does a zero mean? What does a one mean? What does a 999 mean? What does bad data mean? And that’s I think, once we can use AI to solve that problem, then we’ll be cooking.
Shane Gibson: Yeah, so I think one of the things I’d always recommend people do is, when they’re talking to their analytics team or their data scientists be very clear about which of those three cogs you’re asking them to do.
Shaun McGirr: Who’s this person talking today, scientists who is supposed to know which cog they’re asking the data scientists to turn?
Shane Gibson: Well, let’s go into it because use the word Chief Algorithm Officer. And I was like, Oh my God, I have heard this before it started to come out. And so it almost got me on my rant, which was the CIO, the Chief Information Officer. The key there was information. It’s not the CIT, it’s not the Chief Infrastructure Technology Officer. It’s about information. But because that role never delivered, for whatever reason, we then came up with Chief Data Officer and now we’re going into chief algorithm officer. So I’ve talked about roles, what the hell does the Chief Algorithm Officer do?
Shaun McGirr: That’s a great question. There aren’t that many of them. I think Stitch Fix who I mentioned was one of the first to have one. And one of the good reasons to have one at a company like Stitch Fix is that its whole business model is built around algorithms that select which close to stock, they build a bunch of clever human in the loop stuff so that a human stylist would select clothes for people, but their recommendations would be generated by algorithms. So they put algorithms everywhere, and that’s what runs their business. And so if that’s what runs their business, that sounds like a perfectly reasonable job title to have. But if that’s not what runs your business. So if what runs your businesses, machines, and manufacturing or money, or anything a little older than the 2010, I think we were mentioning before then. I don’t know if you need a Chief Algorithms Officer. And so that does that does raise that point of if you and I both agree that there are two to three cogs, whose job it is to know which kind of work fits in which cog. I don’t think the business which is not a term that we should all love. But I don’t think people out in the business who need problem solved, should or could know which of the three cogs needs spinning for which kinds of problems. I don’t think data scientists are particularly good at knowing which one they like to go down rabbit holes. So somebody that I’m seeing emerge is analytics translator data product manager, people who can actually straddle the world of what actually is going to solve some real problem in the world. But who knows enough about the work that they can help the right people get on the right cogs and spin them?
Shane Gibson: Yeah, there’s a term out there that came out of Weiner from data.world, and he talks about knowledge scientist, which is that translator. And I love the idea, hate the term. So, I think what will we end up seeing is analytics product managers, or maybe algorithm product managers. Well, algorithm product owners, and they will be that person that sits between the team that’s doing and the executors that are funding to describe the tradeoffs and make that call that value call. So we see that with data products with a data product owner, they saying, I can invest the team’s time and ‘Product A’, I can invest the team’s time and ‘Product B’, we’ve done some initial work and we kind of understand potentially how big it is how much effort maybe, we understand what it might be used for so we can add some form of value. And I’m gonna make the tradeoff decision because somebody’s got to. And I think when we talk about those three cogs, we’re gonna start seeing that is, some form of algorithm product owner that goes oh, this is actually cog one. This is a hypothesis with a high level of risk. We need to time box so they will make the call that it’s worth it but cog want it to and this is when you run out of time. Well, they’ll go cog two, “Okay, we have a little assurity, this is good. Let’s go let’s invest in it. It’s going to be a bit rocky”. We will eventually cog three. It’s a proven pattern. Just grab this, put it in, get the value and let’s move on to one of the other side. I mean, we have to come up with better terms and cog one, cog two, cog three because it sounds like Dr. Zeus to me, but I’m sure somebody out there will make up for two for classifying them together.
Shaun McGirr: I think the missing link fair for me is that their products word which you and I discussed that a lot of five-ish or more years ago. And I think it’s great to have seen a little bit of maturity come to that term in data in general. And through that it will and it should eventually impact this Data Science and this world of algorithms as well.
Shane Gibson: Yeah, we might almost see a term analytics as a product. But the problem is it starts off with the word us. So it’s so interesting with it CIO, though, I’m assuming that person’s sitting at the top table and their organization.
Shaun McGirr: In that organization, yes.
Shane Gibson: So for me, that’s the key. Whether you call as a CTO or CMO, or CEO, or CIO, whatever, the person who’s helping with data and algorithms is sitting at the top table. So they have a voice, they have money, and people to make things happen, either they own the team, or they can control what’s done by the team that somebody else’s own. So they have the ability to execute. And they are empowered to make the value call on behalf of their other C level peers. So obviously, they got to talk to them and all that kind of stuff. But the CFO makes a decision typically, of when finance needs new finance system, or normal finance people or a new portal for procurement or whatever. And so the person that’s got the C level for data or analytics needs the power, the man to be able to do that as well. And I think that’s one of the things we look at, when you look at a large organization about how data driven they are, who’s at the top table with data and analytics.
Shaun McGirr: It’s a great proxy, who’s at the top table, who’s got that. And then also given that that’s almost never true. How far away from the top table is the person with the credible view of data, and I see a range of people. In my work, I see people who shouldn’t be there person who have almost no team and almost no influence. There’s some kind of cheerleader. And then I see people who do run a whole organization, but they’re not high up the organization enough. And often, that’s just data is not actually that important to that organization. I think it’s rare still to see a Chief Data Officer, Algorithms Officer, information officer who is truly as influential and accountable and trusted as those other C level execs.
Shane Gibson: And it’s quite interesting if you put the 20 team cool kid liens on it for companies that were started in that, that genre and data driven, just thinking about it, I wonder how many of them, it’s because one of the cofounders is a data person and they are at the top table by default. So they’re driving the vision of the company to say data is more important than most other things. And that’s why they behave that way. Where for companies before their time, the hierarchal fiefdoms have already been created. And therefore, there is nobody at the top table, because that’s not how the organization was created, that roll didn’t happen 20 years ago.
Shaun McGirr: Yeah, I agree completely. And I talk about it often with large enterprises, government departments, who keeps saying that they want to be data driven. And even CEOs say that, but they leave the pesky details of how to other people. And those people don’t have a mandate. And then someone starts a data literacy program. And then a bunch of people learn how to read data, but they have nowhere to apply it, no time to apply. And even if they did, what would change about the organization. And so unless you’ve got a shining example at the top, who is constantly changing behavior, and the direction of the company on the basis of data in a way that meets all those quality standards and rigorousness standards that we like, it’s really hard to see where it comes from an organization level.
Shane Gibson: Yeah, and we see that same with Agile, so outside of the data space, that when we work with teams at the bottom, and they start adopting a new Agile way of working, most times they are successful, depending on what you described as success. But when we see an organization tried to do an Agile transformation where the goal is become Agile, not change the way we working. Agile patterns might help us be a little bit better and the CEO is not coming from that background. So they don’t meet with the appears on a daily basis to check in and how they’re going and they don’t playing together and all that kind of stuff and they’re not Agile, so therefore the organization never will be. And so I suppose it kind of comes back to with data, that Tom Cruise movie “Money Ball” I think it was, show me the money. If you’re in an organization where somebody at the C levels yelling at you show me the data, then you’ve got a chance of being data driven. If they’re not doing it, then it’s hard because the organizational culture, it’s not about data.
Shaun McGirr: Just our point of bottom up versus top down Agile, what’s your experience with safe?
Shane Gibson: The way I describe it as, I am unsafe. So I try and keep an open mindset as all Agile people should, I have a closed mindset when it comes to safe, I see it as a prescriptive methodology that doesn’t help you describe a way of working, it gives you a structure that looks like your organization with slightly different slippers and names on the doors.
Shaun McGirr: I’ve had limited experience, but it does come up sometimes in my work, where people ask us at Dataiku to explain how our product, which is just a data science and machine learning platform will fit into a safe organizational architecture and it can be quite challenging.
Shane Gibson: Yeah, as we know, we probably should talk about another podcast. My prediction is this year, and next year, you’re going to be asked how it fits into data mesh, just as much as safe. But go on to a different subject then. So you’ve talked a lot about what happens when the first data scientist turns up at a company. And I think one of the stories you talk about, as data science turns up, there’s no data. It should be data scientists Leaves Company until data turns up. But in some of your other podcasts that you do, you also talk about the scenario that I love, which is data scientists turn up and they start hassling everybody in the organization to send them that spreadsheet, they start crowdsourcing the data that does exist, and they’re not putting it in a warehouse, they’re not going and doing automated data collection. They’re just getting somebody who has to send it to them. And number eight, wiring it for a while. And one of the things I struggle with is that chicken and egg scenario. So if we wait for the data, then we don’t know what the hypothesis is? We don’t know what that small cog is, are we just gonna go? But if we have somebody come in, and there’s no data, then they have to have their engineering capability. They have to be able to grab whatever data is there and make it usable before they can go on and prove that small cog. So for me, I don’t understand why the model isn’t. You hire a data scientist and an engineer and they sit together a part of two, squad of two. That’s the de facto what it shouldn’t be. You’re going out and harvesting data that’s never been harvested before, and I see me ad-hoc way. But you’re trying to automate as much as it makes sense on day one. You’re not doing a Sprint Zero, where you’re building an analytics platform after 12 months, you’re grabbing a bunch of data and a couple of days or weeks building something that is semi reusable, but not really for has value, you’ll do more work on it and make it more automated. What’s your view? What’s happening out there is the part of to something that’s going to happen? We’re going to start doing, you can’t ever data scientists until there’s a squad that does everything. We’re going to stay with data scientists don’t turn up until the data warehouse or modern data platforms in place.
Shaun McGirr: I think when we talk within Dataiku to companies of all sizes about do they have an ambition to apply data science at scale and to have lots of different kinds of people do their different skill levels do that? One of the most common objections to that kind of scaling or democratization of data sciences? Well, we first have to get our data lake perfect. And then we will think about that. So there’s one common response, which very much fits the pattern of get all the data, get it all right somehow. I’ve never seen perfect data. But it’s a very nice, dream to have and then we’ll think about interesting things to do with it. But we’re willing to, so that one end is that. The other end there’s the hell with any reusability structure, we’re just gonna let a whole bunch of individuals. Truth is always somewhere in between. What I see in the teams that I work with for advice ultimately tried to convince, there’s still a huge amount of functional separation in those data teams. Maybe I’m in the UK now. And I don’t know the teams are bigger. And so there’s more refuge in just talking to your data engineering colleagues. But it’s just a tiny handful of at least, our customers and our prospects who I’ve met, who are doing something as radical having an engineer and a data scientists work together in that way. Almost everyone is still running. I have a Data Engineering Lead, a Data Platform Lead, Machine Learning Engineering Lead, Data Science Lead, some other people who might help those people talk together. But there still seems to be a preference, at least in this market for different teams doing different work, and sometimes using radically different tools to do that. And I understand where that comes from. It can’t be the future.
Shane Gibson: So it’s interesting. I think, scaling is hard. I don’t think I know. One team of five people can rock it soon as you start to scale. We now have to specialize. Our natural reaction is often to specialize based on the process, we become a factory. The person puts the diode together, and then the next piece group team puts a diode onto the circuit board, and the next thing sold or that factorization is kind of a known process. And I’d actually have a problem with it, because it’s a flow and lane based process. From an Agile point of view, we focus on different things, though. We focus on how that diodes handed off to the team that put it on the circuit board. So in that case, we focus on how the engineering team handoff the data to the analytics team. And then how do we deal with the uncertainty because in a factory, we know where the diode is, and we know where it goes on the board, you’re not randomly choosing a new place.
Shaun McGirr: Whereas we’re making a million of them a day.
Shane Gibson: Yeah. And that’s not what we tend to do with data and analytics at the moment. So the key thing for me is you focus on how you’re scaling, how you’re specializing a bunch of those teams, and then you focus on the friction between them and reducing it? But what often happens with large organizations, we’re retrofitting it. We try to take these new ways of working and retrofit it to a bunch of people that already exist. So we start off with a scaling problem. We’ve got 20 People in our data team, and 15 in our visualization team, and 5 BI’s and the BA team and 16 Data Scientists somewhere, and now we’re trying to retrofit new way of working with them. And that’s hard. So my answer is treated as cog one, it’s a hypothesis. Grab a couple from each of those teams, put them together, tell them actually the goal is to find a new way of working. And to do that they have to actually build some analytical products. The analytical products themselves may work or fail, that doesn’t matter. It’s the way of working that they’re experimenting with. That’s the hypothesis. And then once it works, say, how do we apply this again and again? How do we scale it? That’s what I recommend people to do. And data with data teams, not so much of data teams that have analytics people, and some reason we treat them as two different things. It’s amazing how quickly 15 minutes go. So before we close it out, anything else top of your mind around Agile and analytics?
Shaun McGirr: It’s funny. One of the presentations, I gave it optimal. I think had that title “Agile Analytics”. And that was maybe six or seven years ago now. I wheeled out recently when someone on slack at work asked, does anyone have any recommendations for how to organize Analytics Teams, Data Teams, Data Science Teams, according to Agile principles, and there’s the thing that our prospects and our customers really want advice on is what they call operating models and organizational change, and how do they become more data driven, more AI driven? Very few of them ask us about what we’ve been discussing today about organizing the work of the teams and how they actually do that, which is fascinating. Because how people do their work is to me the most interesting question. I think that the technology has come a long way in 5, 10 years, a lot of things that were actually technologically quite difficult in terms of handovers and other things that friction is reducing. So I think that the teams that do crack those new ways of working are going to have some kind of edge and going to deliver more value but it is interesting this discussion we’ve had is not so much one that I have in the industry and the customers that we have. I think that might be when you’re working for a software vendor, you’re talking to a kind of data leader who’s making a procurement decision. And they’re looking for some piece of a technological architectural puzzle, how you do the work day to day is, I don’t know how much thought people are giving that.
Shane Gibson: Yeah, so I don’t agree it’s only software vendors that causes problem or have this symptom. So I’ll look at two examples of back that up with. So one is, I’m still side hustling consulting with big companies to bootstrap our startup. So I will do either an engagement where I go in and help the teams or I’ll do an engagement where I create a blueprint. So not a 500 page strategy that nobody ever use, but a smaller document that has the components that you need to get started, and one of those is the operating model. And to understand that, I typically get the team that currently doing some work and to explain to me the data supply chain. What are the steps you take from the beginning to the end, and then we have a conversation about how much of that needs to be centralized or decentralized based on the size of the organization the way you want to work. So their operating model is really, really important. Doing a side hustle, at the moment where we’re evaluating and implementing a massive modern data platform. And day one, I said, first thing we need to do is understanding the operating model and the supply chain and we haven’t done that. So now we’re putting in tools and technologies without understanding who’s doing the work and how they’re going to do it. That’s high risk, that’s called one. The second one is if you look at all the stuff that’s published, even from the 2010 cool kids, they publish the platforms, they publish their architectures, they publish their code, they really publish the way they work and the supply chain, and Spotify did it and we abused it. The market called it the Spotify model, McKinsey went and made it their bread and butter. And so Spotify are no longer sharing with us their way of working, which is gutting, because they have invested a lot of money and cog one and hypothesis testing ways of working, and hypothesis testing organizational structure of the data and analytics team and their supply chain, and they’re sharing it with us. And we do bad things to them. So they no longer share, but I’m with you. We don’t talk about that. We talk about technology. But it’s not just software vendors, it’s everybody.
Shaun McGirr: Yeah, it’s interesting. Something you said, taught me a long time ago was that whatever you’re talking about in this data space, you can boil the work to be done down into three steps, get the data from somewhere, do some stuff to it, and put that result somewhere that matters. And people forget that, and they go off and buy some things. And they’ve never even looked at that supply chain, which is just a set of operations of get data from somewhere, do some operation on it. Even if it’s an algorithm or some kind of fancy AI, it still fits that pattern. And there’s both the supply chain end to end of those operations. And then within each of those operations, there’s more getting of data, doing stuff, putting it somewhere. And a little bit of analysis of how people currently do that, and how they might want to do it goes a long way.
Shane Gibson: Yeah. And I’d probably nowadays, I think I’d say these five steps. So that’s the factory, that’s doing all the work. I’d say there’s a step before and escape often now. Step before is prioritization. You’ve got a bunch of constraints in terms of the people, how do you decide what work should be done now, and what you’re going to hold off on doing? And the last one at the end is proving the value was delivered and the action was taken. So how do you know that investment decision you made was worth it? And those two things need to come into that workflow now. And I think we should call it a day. So thanks for your time, Shaun. It’s been awesome talking to you as always, and I really enjoy actually talking to somebody that’s got experience, both working with analytics and hands on and with teams. And somebody that’s kind of had an Agile mindset from day one as far as I’m concerned. So it’s been great talking to you.
Shaun McGirr: Likewise, and here we’ll have that data mesh discussion one day.
Shane Gibson: We will indeed.
PODCAST OUTRO: Data magicians was another AgileData podcast. If you’d like to learn more on applying an Agile way of working to your data and analytics, head over to agiledata.io.