The patterns of Activity Schema with Ahmed Elsamadisi
In this episode of the AgileData Podcast, we are joined by Ahmed Elsamadisi to discuss the patterns of activity schema.
The activity schema is a data modeling pattern for data modeling and transformation.
They discuss:
-
Ahmed’s Background in Data: Ahmed started in self-driving cars before moving into AI for missile defense and then to WeWork, where he built the data team and encountered various data modeling challenges.
-
The Centrality of Data Modeling: Ahmed stresses that data modeling is key to managing scalability in data systems. He observes a common trend in the industry of constantly revising data models, which often don’t scale effectively.
-
Frequent Data Stack Changes: At WeWork, Ahmed experienced multiple data stack changes, indicating a common challenge in the industry of finding an optimal data stack setup.
-
Introduction to Activity Schema: Ahmed introduces the concept of Activity Schema, a data modeling paradigm designed for scalability and simplicity in data handling.
-
Evolution of Data Queries and Systems: The podcast discusses how the nature of data queries has evolved, requiring more sophisticated and nuanced handling of data from multiple systems.
-
The Limitations of Traditional Data Models: Ahmed points out the limitations of traditional data models like star schema and data vault in handling modern, complex data queries.
-
Simplicity of the Activity Schema: The Activity Schema simplifies data modeling by focusing on user activities as the central element, allowing more intuitive and scalable data handling.
-
Handling Complex Data Relationships: The podcast explores how Activity Schema allows for more flexible and efficient handling of complex data relationships and temporal data queries.
-
Real-World Applications and Examples: Ahmed shares real-world examples from his experience, highlighting the practical benefits and applications of the Activity Schema in various business contexts.
-
Future of Data Modeling and AI: The conversation touches upon the future of data modeling, the role of AI in data analytics, and how new approaches like Activity Schema are shaping the industry.
Listen on your favourite Podcast Platform
| Apple Podcast | Spotify | YouTube | Amazon Audible | TuneIn | iHeartRadio | PlayerFM | Listen Notes | Podchaser | Deezer | Podcast Addict |
Recommended Books
Podcast Transcript
Read along you will
Shane: Welcome to the AgileData Podcast. I’m Shane Gibson.
Ahmed: And I’m Ahmed Elsamadisi.
Shane: Hey, Ahmed. Thanks for coming on the show. Today we are gonna go into another one around data modeling, this time activity schema. And I’m mega excited because I’ve read a bit about activity schema and had a little bit of a play with it, but no way have I been in the weeds.
I’m looking forward to understanding what the pattern is when we should use it, more importantly, when we shouldn’t, and diving into it. But before we do that, why don’t we get a bit of a background from the audience about you, how you got into this lovely world of data.
Ahmed: Awesome. So I actually started my career in self-driving cars in 2010 before it was cool. And so got to spend a lot of time focusing on AI and decision making. Moved my way into AI for missile defense with the US government and then eventually joined the startup WeWork. And built out their data team and watched it scale.
So I got to see data in used in all sorts of different ways from mission critical applications to business decisions. And what I found was that having implemented the system at WeWork, we went through maybe four or five data stack implementations. We kept running into issues and couldn’t find any good solution.
So what we realized was that everything was always coming down to data modeling. And data modeling always is the problem for why data scales terribly. And I think a lot of people nowadays are beginning to know that with D B T publishing and article about how their data modeling didn’t scale and so many things of that nature.
So I’m excited to show this paradigm that we’ve been using a lot of our customers use and it scales really nicely and hopefully more of the world can experience it.
Shane: We know that bringing back data modeling’s kind of one of the themes this year and, make data modeling cool again. For me, you’ve come from working on self-driving cars and working on, rockets to data modeling. How cool is that, right?
You come from being a rocket scientist to a data modeler, that’s a progression.
Before we jump into activity scheme though, let’s just cause we tend to go all over the place on the podcast and that’s all pretty cool. We always come back to the core theme. But that idea that at WeWork you went through five data stack changes, talk me through that.
Was that because there was new cooler technology out? Was it because the first four didn’t work? Was it cuz you were just iterating? Cuz that’s a fairly expensive change, depending what you mean by changing your data stack.
Ahmed: Yeah, so I feel like a lot of people actually go through the same flows. We had a Postgres database Aron scheduler that would schedule our queries and materialize them. And we had chart for visualization and we’re like, this seems great. And then it, we just had so many things scheduled, answering questions was really hard and the company was like 200 people then.
So it wasn’t really that much, that many questions and we’re like, aha, I’m gonna upgrade my database. I’m gonna switch from Postgres, Redshift, from Unicorn to at that time, Luigi was popular. So we used Luigi and then we had Chartio. And still more and more data models still can answer questions and we’re like, ah, I know what it is.
It’s not Chartio. We need Tableau. So we got Chartio and Tableau, and we’re like, that’s gonna enable us to answer more questions. It didn’t. So we’re like, okay, it’s probably this. Instead of Luigi, I switched to Airflow. And so we had Redshift airflow and instead of Chartio, we switched to look.
We had Looker and Tableau and we’re like, that was gonna help. And then we’re like, people are still yelling at us and numbers are not matching. So we added a dot and we’re like, that’s gonna solve the problem. We added dictionary, we added some lineage. Still numbers don’t, like numbers still don’t add up.
Sometimes people have so many questions, just like the backlog we’re like, aha, it must be Snowflake. So let’s switch from right here to Snowflake. And then we’re like, fuck Airflow. Let’s build our own custom scheduler that does our own things. So we built something called we Module and Data Manager. Oh my God.
And then we had Looker, Tableau. And then we had another tool, somebody bought Wave Analytics and we had t and it’s just so many tools, like it was just so many tools and we had a 45 person team and we still had the same problems. People would have questions we couldn’t answer. Oh, we also switched from, we also bought Heap and Amplitude to help and like still people had so many questions they couldn’t answer.
Everyone was complaining. A lot of the dashboards didn’t add up and it was just a shit show. So we’re like, why is that the case? So we actually went on a tour, and this is an interesting thing, is like I went on a tour. I talked to a bunch of big companies like Airbnbs and Netflix and Spotifys, and I was like, How do you guys deal with this?
And it turns out every company is in the same boat. Everyone is just refactoring the same thing as every time you refactor it seems great for the first couple months, you’re like, oh my god, I just threw away. And then you refactor and you, or you buy a new tool and you just move all the stuff that’s already working.
So every tool appears incredible for the first couple months cause it doesn’t have all the shit yet. So that was the exciting. So all these companies were just like, yeah, that’s that’s the job. The job of data engineering is to manage the shit show. That’s it. There’s no solution.
You just try to minimize it and all these tools are helping you manage this , Growing chaos problem and what I wanted to do is is there a system that’s designed for, why does chaos of data continue to grow? Understanding what causes it and then understanding what part of this whole data stack is the real problem. Cause as you dive into it, you realize that Snowflake and Redshift and BigQuery, like the slight differences don’t really matter. People love talking about it. It’s a millisecond faster. I don’t care. I have these things being materialized. And then people talk about the query editor and whether it’s sync to GitHub or embeds Ginger.
It’s okay, but yeah, but doesn’t really matter. Still running, compiling down a sequel. So okay. So I’ve used all these different tools that haven’t really changed our approach. I use dashboards and Looker, whether it’s purple or chart or Tableau, whatever color their color scheme is, doesn’t only make.
The solution better. So we’re just constantly using new tools and using those tools as an excuse to refactor, but this fundamental approach doesn’t work for at least what I’ve seen. I just couldn’t find a single person who was like, yep, a year into my stack. And it’s great. Everyone’s like a year and they’re like, I wanna refactor and blow this thing up because it sucks.
So that’s where my whole life was like, okay, I need to really dive into this problem and let’s.
Shane: Yeah, it’s really interesting, isn’t it? It’s interesting as data people and as technologists, we love to refactor our stack, we love to refactor the tools. We love to invent stuff, not invented here, but we hardly ever refactor our way of working, we sit back and we watch our data flows,
we watch the data move and we go, oh, there’s a bottleneck over there. We’re gonna go and, we’re gonna go and refactor that bottleneck. But we hardly ever sit back and watch the way our team works. We don’t , treat humans as part of the system and go, where’s the bottleneck?
Where’s the bleed? I was lucky enough to work with a guy called Sean McGurk many years ago, and my consulting company, and he now works for Data iq and he was running, before that, he was running a analytics team over in the uk and he told me the story.
He went to a a conference in the uk, meet up, stand up, whatever. On the stage was, one of the fan companies, can’t remember which one again, Airbnb, Amazon one, one of those big companies. And he said to him, he said, how do you deal with the data problem?
And the person on the stage said, what do you mean? And he goes when you wanna find some data, you’ve got so much of it, how do you find it? And the data scientist said, what’s easy? I just go into Slack, I log a call 15 minutes later, the data team have got me the data. And so they had hundreds of people.
That served the data to the person that needed it. And they scaled with bodies. . Not with ways of working and not with systems. So for me, that’s fine if that’s your business model and you can afford it, but for most of us, we can’t. And it’s not overly efficient. And if you think about that five of stacks, at what stage did you bring in data modeling?
Ahmed: First of all, the story you had is such an incredible highlighting of the problem, which is even the greatest, the most advanced tech companies that sell you these software to do all these things. They just shut people out. The problem.
It’s they’re , whatever. I’ll just have people go deal with it. So that’s just kind an interesting insight thing. But the question I think is always gonna be why? Why is it hard to find data? I have a thousand tables, so which table do I use?
Why do you have a thousand tables? The idea of activity schema, which I’ll bring it to that is what if you had less? So if you had one table, only you would say, where is the data in that one table? Problem solved. Cool. If your data was standardized in the structure, if you were like, what data is in the table?
The structure is consistent, so you don’t have to ever what does this column mean? Once you have one table, Then data lineage goes away. The lineage one data dictionary goes away. Cuz the table should be self-explanatory. It’s consistent. Instead of going and messaging someone to be like where do I find data?
It’s I only have one place to look. So having the concept of one table is really nice because it makes things so much simpler for processing, for understanding, for finding, for the way that humans actually deal with data. Now the question is, why doesn’t everyone just build one table?
That’s seems like a simple solution. And I think to understand that, like we have to go back to do we build multiple tables and why do we build them more interestingly? Why do we build it the way that we currently do? And that was like my obsession. So the reason why you build multiple tables is you have different questions. So you’re like, okay, cool. Every question has, you have to answer the questions and therefore you need data. But you might think why don’t I have one table for every concept that I have? Cause the company doesn’t have many concepts. And then all I have to do is combine those tables and I’m probably solved. And I think there lies the problem. Combining those concepts into tables that are used to answer questions, that’s what like 95% of your tables are. They’re combining and with questions and the way people ask questions, there’s constantly new questions that relate data differently.
And because you’re constantly asking questions that are unique, you’re constantly relating data in a unique way, which means you constantly have to build a new table to represent that data. So follow up question is okay, then doesn’t SQL allow me to do that? And then so this is where the data splits off.
So the star schema was designed in a world where you had one data system. When it was designed, you had every company had one Postgres. Data storage was very expensive. Compute was also expensive. You would take your data from one database and build aggregate tables so you can actually visualize X and Y because you’re working within one system.
You had foreign keys. If anyone who’s listening to this, a foreign key is just a way to join data with itself. And the questions people were asking was , how many comments are on my ticket? And therefore, you joined comments and tickets aggregated, and it makes perfect sense. Today’s age, every single tool has embedded analytics.
No one’s asking, your data team, how many comments are on a ticket like Zendesk or the ticket tool will give you that. They’re asking more behavioral questions, which is okay, will people submit a ticket? Are they more likely to churn? Or what part of the app resulted to people submitting the most tickets?
Shit. Now I have two separate systems and those systems don’t talk to each other. So you need to write some sort of complex query to combine the data in a way that answers that question. And this is the separation between writing. When people talk about what I know sequel, it’s like sure, and every single BI tool can generate sql.
But the sequel to combine those two sources are very complicated, is very nuanced, is very, it’s never predefined. And that’s what your data team is spending all the time doing is combining that. And because it’s so hard to combine it and requires so much sequel and code, you end up saving that table. And that’s why you end up going to a person who knows exactly that table and then can tell you, here’s a table to use.
Does that make at a high level,
Shane: Yeah, look, there’s so many patterns in there that I just wanna unpack, so where do I start? Where do I start? Okay, so let’s let’s go to the star scheme one, because really it intrigues me that the US is the king of star schemers. And actually we see Europe has been more data vault modeling.
And what’s intrigued me is, when D B T started out and they did that awful thing, because I’m hyper-focused on terminology. Language is important. If you see a cat and you call it a duck and you keep calling it a duck, it’s still a fricking cat, it’s never a duck, no matter what you do.
So the fact that they came out and they had this term model, which was a blob of code, it’s done so much damage in the marketplace, in the data world. And then when we start moving towards self-service democratization, we know that we lose rigor and that’s okay,
we’re getting a whole of new people who don’t know how to do it in the way that we were trained, and therefore we gotta help them apply this balance of self-service and rigor, without making it awful for them. So what we see in D P T world is they’ve now realized that 5,000 blobs of code is probably a bad thing.
We could have told you that, but that’s okay. You have speed of market on your first a hundred and that’s good. So what do we do? We go, oh, you need to model. And for some reason everybody said dimensional modeling is the way to go, but if we look at dimensional model and star schemers,
what was it there for? It was there because our technology was so constrained, we couldn’t just dump all the data in one database and ask a question. We had to make choices of how to optimize it. And the star schemers was a brilliant way of optimizing it. People keep telling me, oh, analysts really love star schemers and understand it.
And I call bullshit, yes. We can teach them that, a dimming effect and the grain of the fact and you join back to the dim and then you’ve gotta do all these horrible joins. But actually in my view, you guys can analyst what they want, one big table, just gimme a table of the data and they go, oh, but then, if it’s customer orders, you can’t count customer, you go count distinct customer and you get the answer.
Oh, they won’t know that. And then I argue so you’re telling me that actually you can teach them how to join facts and STD type two dimensions and get the right answer and actually somehow count distinct when you’ve got a table that’s got model orders for a customer is more challenging.
Yeah. Cool. Bullshit again. The next thing is we know the first question we always gets a simple one. The first question we ever get is how many, and we know that’s not actually the question, that is an exploratory process where somebody says, give me the answer to that, and then I’ll know what the next question to answer is,
and it’s typically how many or how much? Yeah, how many orders do we have? Or what was the total order value? Then we have some kind of breakdown. Where did it happen? Which store, which product type, which, order values above a hundred. So we get these weirs and these buys, and then we’re gonna get the how longs, oh, how long was it between them placing the order and our shipping?
How long was it between , then placing the order and paying, and then the exceptions. Oh how many people placed an order and paid and we didn’t ship? Actually, how many people placed the order and we shipped and they didn’t pay? Because that should never happen. But, oh, it happens in the data.
And then we get to the last one, which is why why did they place the order and not pay for it? And so that story of answering those questions is repeatable, we said time and time again because the person doesn’t actually know what the real question they want to ask is. Yet. They’re using data to explore the problem, , get that question, and then they want to answer it.
So yeah. I reinforce the messaging you’re giving is that it’s a problem for a long time.
Ahmed: I wanna add two more things that happens after in between the why, which is key, is that part of the why is they have hypotheses and the stakeholder hypothesizes doesn’t care about the order. So they say did that person receive an email campaign? What was the last campaign they received?
Did they come to our website? Maybe they submit a ticket between when they got shipped, and I wanna know what happened there. We all can agree that’s the flow. Now when you look at how we answer those questions, we treat every step like a production system. So if we were gonna write this whole query to do it well, now you wanna add another concept to that query?
Oh my God, that is super complicated because now you’re combining multiple and multiple systems. That query gets really long and because the way that SQL is stack based, which means that SQL is joined, any sort of thing you’re not careful with, you can duplicate rows, you can drop rows and you can cascade and blow up your whole thing.
So I think that process of how we answer that should be the way that we’re talking about it. I ask that question, I answer it, I ask that question, I answer it and it should feel like that kinda minutes. But today we have to go through a whole fucking cycle what 20 questions that we know we’re gonna ask in that order.
It would take a data team two weeks per question in a whole cycle and materializing table and now have to maintain 50 tables and that shit show. So that’s one thing. The second thing that I wanna touch upon that you also mentioned is humans and words. Words are so important. We used to have this joke, which is people say, okay, how many sales did I get? . It turns out that if you asked the sales team how many sales they get, they care about the timestamp being when the customer signed the contract. If you ask the, entire executive team, they care about when the customer started their reservation. If you ask the finance team and they care about when the customer made their first payment, these things are different. So just think how many sales do we get is really unhelpful. And what we’ve done is we’ve expected so much information to exist in a column, in three words of a column. So we see this table name sales and web data. And you’re like, cool. And it has a column, total sales.
But you don’t know what that total sales actually really means because that total sales can be based on the first contribution, based on the last touch, based on when the sale happened. There could be so many things and that just language issue is such a fundamental issue. So that’s a first language issue.
Second language issue that happens is that the person who’s asking question is asking it exactly the way you did, which is did they call okay, how long they take them from the next order. They’re describing it in a very human-centric way. They’re talking about building blocks or like actions that the customers doing and relating those actions. , the data engineer or data person that has to answer it has to take that question from action and time and relating it in this language in English and convert it into sequel, which is Stack Table, and how do I join it? That was the first insight that led to activity schema. I said, what if we don’t have to change it?
What if we can actually take a structure that represents the way people talk about data and what if we can actually relate data in the same way that people think about relating data? Those two aspects is what made the activity schema what it is. The first concept is that you break your world into activities, which is they completed the order got shipped.
They got a call, they submitted a ticket, they did this thing. The second thing is that you relate data in the activity schema world, using what’s called temporal joins. In between the order, did they call us, give me the last time before they called us what the ticket was. Notice how in those worlds, you never predefine any joins.
You never have to worry about which system it came from. And it’s super analogous to the way that humans are asking questions. And by creating a structure where you can combine data that way and ask and answer questions that way, it just mitigates that cycle so you don’t have to worry about the four existing.
You don’t have to worry about weird joints, you don’t have to worry about duplicating rows. You don’t have to worry about dropping rows. You can add, and then when you wanna add something, because activities can always be appended. You can add many temporal joints without ever changing the number of rows in the data.
Once they go, okay, I added the in between. Did they get a shift? Okay, did it get returned? Okay, wait. Then before they submitted the order, did they submit a ticket? I can always add as many more dimensions as people ask, follow-up questions, and slice and nice that data without ever having to worry about.
Going back to the original definitions of bottles, as long as the concept is defined as an activity, you can just use it and be happy.
Shane: So let’s unpack that a little bit, , I’m gonna play it back to you. The way I understand it. Tell me when I’m wrong, so you use the term activity I could use the term event, or I could use the term core business process, or in data bulk terms, I could use the term link or in dimensional modeling.
It. It’s like a fact, again, the example I always use customer orders product, customer pays for order, or do they pay for product? Good question. Store ships order or do they ship a product? Good question. Customer returns product, to return the order.
So those are four core business processes or four events that we’ll see in the data, relationships and concepts. And so for activity schema, what we’re talking about is the term we use is activity. And we say we saw that activity happen. From my podcast with Hans, we talk about Peter the fly.
If I was sitting on a wall and I actually watched what happened, with the humans, I saw a customer and I see them all to that product. And so what we do is we rack and stack those activities into a single table. So now we’ve only got one place to go. You wanna ask a question that’s the table for you, one table to rule them all.
And then the second thing is, we know that often the question as we talked about is the why, what’s the time between these activities happening is actually, the hard question. And why is it hard? Normally in a dimensional world, we’ve gotta go away and we’ve gotta grab multiple fact tables and multiple grains, slam them together, and then be able to do the time series, query about temporal,
when did that happen? When did that happen? What’s the difference in time between those two? Or in data vault, we’ve gotta grab the links, we’ve gotta slam them together. And we’ve gotta say I’ve got multiple events now, or activities, and then the same kind of thing. So with activities schema, by putting the activities in a single table and having some core metadata effectively around business effective dates, and when that thing happened, we can now just write a windowing function,
it’s the same window and function to say, take this activity, customer ordered product, and take this activity, customer made payment and tell me the time between, because we know the structure of the table and that query becomes repeatable because it’s exactly the same pattern of query. Did I get that right?
Ahmed: So 95%. So the activity scheme away is instead of thinking about it as window functions, We like to call them temporal joints, and we found there’s only 10 of them. So why are we not using window functions? Because as you add data in and you wanna add columns, you wanna be able to pull more nuanced relationships.
So the temporal joints that we have is either first ever last ever. So gimme everyone’s order and gimme the first web visit they came from. And so I can see the ad source from it easy. Gimme the last order they have, so I can know if they’re an active customer or not. So that’s first ever last ever.
We also have first before last, before most common when they had this order, gimme the last time before, and you could add within 30 minutes, did they submit a ticket? Then there’s in between, which is the. Okay, now in between orders, did they email so not just not ever after, but in between the specific interweave, the orders give me the do it.
So by having these very specific ways of actually doing it, you end up creating a more consistent, standardized way of relating data, which is these 10 temporal joints. And you can add as many more dimensions relating the data to the original starting what we call the cohort activity. So you start from an order and you can add as many things last before this first, after this in between this aggregate all the orders over aggregate all ever.
So I can have the L T V. All the sorts of things. Add the next 30 days so I can get LTV for the next 30 days. All that becomes really easy because you’re always stacking that data or appending that data to it. So that’s the, it just creates a more standard way of relating the data and you’re always trying to describe that customer journey.
I care about when this happens and when this happens in between, or when this happens or this happens after, or that sort.
Shane: Okay, so if I don’t use the word window function let me give you another pattern that we had years ago so we used to have IAP cubes. Back in the days when, again, it was, it, it was constraint based, our relational databases couldn’t hold, high volume aggregate queries.
So we then had another database which was multi-dimensional, so roll mo a whole different techniques. And the goal of that was be able to get aggregate queries faster than financial relational databases. And we typically had analytical functions, so what you’re saying is with the activity schema, we get 10 patterns, 10 query patterns, 10 analytical functions that are time-based,
so our ability to ask questions over time for those activities or events, and the relationship between them gives us the answer. Cool. So did I get that right?
Ahmed: Yep. Perfect.
Shane: Excellent. Okay. Activities or events, we rack and stack ’em into a table. So we know that there’s one place to go to see all the events that relate to something.
We have these 10 query patterns, these 10 analytical functions, these 10 ural things that say we can ask those 10 questions and they always will give us the right answer. And they’re the top 10 that we always use. And they’re quite complex, if you have to write them by hand. And then the last thing we wanna talk about is concepts
and the way I describe it is customer orders, product. I have a concept of customer concept of order, concept of a product customer pays for order. I now have a concept of a customer concept of a payment concept of an order. So you could think of concepts as a dimon in star schema world, maybe.
Typically it’s something you want to count. It’s something you wanna manage. You manage your customers, you manage your suppliers, you manage your orders, you manage your employees. So in activity schemer, where do the concepts live? Where do the things that bind that activity together to say these things had a relationship at that point in time.
Ahmed: So that’s another uniqueness about activity schema is you actually relate everything to the core actor or the entity that you care about. So you might have multiple entities and multiple activity schema. So based on your business model. We are multi-sided platform. We have a person stream and a company stream because a company does things like pay payments and stuff, but a person might log into a portal.
So for the e-com example that you use, One of the things you said is you already described it in the language, right? So the customer might complete order, we might ship an order. We also might have another activity called ship product. We might have another activity called purchase product. The customer is doing these behavioral things that we call activities, and that’s where everything lives.
So everything that you will tie, as long as it’s related to the door, a customer opens an email. A customer has their order delivered. whatever you wanna do. If someone is doing it, you keep it as activities. And the reason why we do that is explicitness. Instead of having concepts, and you might have payments and payment,
it’s like payments exist as an, as a, as an object. We don’t like to think about objects in the activities scheme. We like to think about customer behavior. So a customer makes a payment. A customer has a payment get canceled. A customer may submit a billing inquiry, and as long as you keep it that way, it matches very close to how people are asking and answering questions.
And that’s what makes it really simple is that if someone has a new concept that they’re asking questions about, you can make it an activity. It doesn’t really matter. Activities can overlap a little bit. It’s okay. You might have, like you said, signed contract, started subscription made invoice, and you can always ask questions using those words.
You can also have some sort of metadata that says, , this payment ties to this specific invoice. But for the customer’s perspective, you can say, gimme every time the customer had an invoice and gimme the first time after they had that invoice. Where then the metadata, you can do the metadata additional join where it is the same id, but for most things you don’t need to worry about that because it ends up being, very less common.
Shane: All right, let me just replay that back to you to make sure I got the pattern and to reinforce it. And then I’m gonna go into some of the complex ones where I think activity scheme has got some problems. Not problems where the context doesn’t fit. . Be clear here.
Every modeling technique we have works really well in certain contexts and these other contexts where we are the cost, so yeah, with data volt, it’s the complexity of the tables that we create. Cuz like you said, we create a shit ton of them. And if you have to query them in this raw form as an analyst, you’re screwed.
We have to do some extra work to solve that problem. But let’s go back to the core of that. So what we’re saying is that the activity schemer is bound by a core concept, so we might say we have a activity scheme. Do you call ’em tables or do you
Ahmed: We call, yeah, we call ’em activity streams. So your activity schema, you have multiple streams.
Shane: Okay. But effectively in my head, it’s the table, so we have
Ahmed: a one table. It’s your single table. Yeah.
Shane: Cool. Alright. And so we have one that’s bound by the core concept of customer. So every activity that relates, or every event that relates to a customer is in that table. Now what’s interesting about that is we have to hold our language slightly when we talk about who does what,
so we go customer orders, product, customer pays for order, that’s good because we start off with the word customer. It’s an activity that customer is doing. So we go naturally yeah. Customer did that. But when I go to store ships product, it’s normally store ships product to customer,
so in a natural language, customers at the end, but actually an activity schemer, what we say is customer has products shipped to them from the store. Cuz we are binding it back to that core concept, so our language is, it’s the same language, but it’s slightly changed, we always put customer in the front and if we can do that customer returns product then we have a boundary for that activity,
that activity schema.
Ahmed: So just to add one more note on that’s really critical makes asking and answering the right question. Really easy. Can I give one example that I like to give? I had a customer they have online platform and they trying to understand when they launch a class, how many people sign up so they, what’s the best time to launch a class.
They had done this analysis and they were , great, we have a table. It’s classes get launched and then they get whatever the average number of signups. And they say, based on when the class gets launched, what’s the average number of signups. Makes sense. So in activity scheme, that’s this doesn’t want you do that.
It’s because like you can’t, you, what do you mean when a class gets launched? It’s but that’s important because when a class gets launched cannot impact the people signing up. Because if it gets launched at 6:00 AM and everyone gets an email at 8:00 AM . care about when people actually see the class.
Even though you’re saying based on when I launched the class, how does it impact things? What you’re actually saying is when a customer first sees the class, how does impact the likelihood for them sign up for it? So based on when customers see classes and that’s when the having that core entity makes asking those questions really nice.
. Because we thought a lot of bad decisions where people were this whole story why this customer ended up by narrators. After we did this analysis, we realized that oh it turns out when we launched a class had no impact. Even though it looked like it. It’s because it’s actually based on when we sent the email.
Cause that’s when everyone actually sees the class in the activity world, that would’ve never happened. They would never been able to make that mistake because you only tie things together through the customer and therefore it’s when the customer sees the class, not when a database updates our thing.
And that’s a really important part that has led to a lot better analysis in the activity.
Shane: If you come back to Lawrence Corr around who does what with Beam and, it’s the same thing, we go who does what? So we’d see, employee launches class. . There’s no customer in there there’s no customer in that who does what. So then it might be, system notifies customer, that course has been launched,
something somewhere, hardware like that. Okay so we bind the activity scammer table to a core concept, so if I use, again my customer orders product, And we look at an organization that actually manufactures the product before they send it out, we’re gonna see another activity schemer table,
which is bound to the concept of product because we’re gonna see maybe products being manufactured, products put in warehouse, products shipped out to customer, so we’re gonna see those activities because we know, Peter, the fly. We are watching that product and we’re watching what happens to fulfill the order, all the activities or steps that happen to that product before that process is finished.
Did I get that right? So we end up with a bunch of core activity scammers, and what we know is actually, if we look at an organization that’s modeled well over time, there is probably somewhere between seven and 20 core concepts. Core concepts in that organization and a really complex organization.
And we also know that actually customer is always the primary concept. If you are, if you’re an organization that deals with people, you might call it customer, stakeholder, citizen, whatever, but a person or an organization, and we get some complexities there is always the driver of most of the events with the activities.
Is that
Ahmed: Yep. Exactly. So in narrator world, you would have a product stream and a customer stream they might have some similar activities, like a customer might have their product manufactured and a product stream might have starting manufacturing. And the reason why that’s really important is when you’re asking questions about your, operational excellence, you might say , how does it take for a product to be manufactured or how many times has a product moved in a warehouse?
And based on those questions you’ll go to the, based on the entity you’re trying to change you will go to that stream. So if you’re trying to change the manufacturing processes, you would look at the product stream. If you’re trying to impact the customer, you would go to the customer stream
it’s okay for activities to be in both streams. Cause a customer’s product also might get started manufacturing , a product might complete manufacturing. And so many things that happen.
Shane: Okay. So subtle in there that I missed, we all know that often we get given a question where there’s no data for it. Why are highest value customers buying that product at 2:00 AM in the morning? We just, we don’t have data for that. We don’t know. Good question.
Let’s go get some data. So that, that example that you used, that product so customer has product manufactured for them. Cause we’re binding this one to the customer activities. Often that’s not available, that’s not data we store, we don’t actually say when we’re manufacturing a product , unless it’s a product on demand for that customer,
we typically , manufacture a bunch of products. We put them in a warehouse and then we pick them and send them, amazon doesn’t say this book was manufactured for Shane Gibson,. It’s in the warehouse and they’re picking for it. But if we can actually say that product was manufactured for a customer.
That’s the way our core business processes work. Then we inject it both into the product activity schema and into the customer activity schema. Because why not? There’s no cost. And now the ability to query that is much easier cause we don’t have to cross join across those two activities.
Schemers, which is where I got to, was my next question is actually joining those activity
Ahmed: you never,
Shane: Must be a nightmare. Because whenever we do any joins of any data, that’s at a different grade. Okay. So you don’t
Ahmed: You don’t. And you never need to because like you, you shouldn’t ever, so like you see this a lot let’s go through some examples, right? Can you come up an example where you might think you might want to join two streams.
Shane: Let’s do the standard one, which is I have a prospect coming into the website and I dunno who they are, but I have some anonymous prospect id cause it’s Google. Then they come in and then they sign up for a form. And so we convert ’em to somebody we know via an email address.
And then we’ll go out and we’ll ping them. We have a marketing campaign. So now I combine that to my customer and then they come in and buy something and then we ship it. And then, we do fraud detection on them. .
Ahmed: So far everything you said is still related to a person. And in narrator there is a concept of a customer and what’s called anonymous customer id, which is if you don’t know, you have your global identifier, which let’s say it’s an email and they might be a credit card number, they might be a name, last name, zip code.
They might be a cookie, they might be a prospect id. So it’s still the same person. But taking your example one step further, let’s say then they sign up as a company and now they’re a company or an account and an account pays an invoice and you wanna know from that person who first signed up, do they pay, how long take them for that first person to pay. In that situation, which we do a lot, you might have the activity the persons , Ahmed who signed up, Ahmed might have Ahmed Company pays Bill. So you might have an activity called company pays bill. The same goes with the company the account stream or the company stream. We might have employee received email or employee submitted lead, or we can have those concepts from the perspective of the entity you’re trying to change, which makes it super easy to query because when you’re asking questions again, your questions are often , I want to know does that person who comes in, how long till their company makes a payment, well great.
I have company makes payment, submits a lead and it’s in the same stream. Super easy, everything is nice and clean. And yes, when I’m building those activities, I might, in my backend database have company and company has user and I might join on the company user and I’ll just. Add an activity for every person in that company at that time that they made a payment.
It’s okay because from the perspective of that person’s individual stream, it makes sense and that’s why we build this thing on top of a warehouse you mentioned activity versus events. I use the unique words so people don’t get mistaken for the way that you naturally treat CDPs, where you have to database, you can combine, you can create that company, you can put that on every single person’s, on every person who care.
You can do that and it makes really easy answer questions.
Shane: Okay. So that’s key though, is that where we see an event come through and using other modeling techniques, we’d ask the who does what where we say in that event, in that core business process, what concepts are involved. So customer orders, product, we have a concept of customer concept of order concept to product.
We bind that together, because it makes sense. Cause we’re not gonna ask that question. What an activity schemer. What we do is we may end up with a customer activity scheer, an order activity scheer and a product activity scheer. And we would push that one activity to all three of those schemers because they’re bound to it.
And now we can ask the question between. Okay, I get that. Reading it. I hadn’t picked that up.
Ahmed: Yeah. And just to make sure, like it’s really rare that you actually have an order activity stream or a product stream. It’s only when we work with Costco, like companies, they have a manufacturing process where they’re asking questions about the manufacturing and they’re asking questions about the person for a typical e-com, it’s usually just that person’s dream because that’s all you really care about.
Everything you’re asking is from the perspective of a person. And you wanna know, like gimme the outliers of when people take the longest to get their product delivered. It’s all relative to the person, actually we don’t give a shit about the products. You care about the person who’s waiting for a long time.
Shane: Yeah, but what we know is when we look at organizations, we focus on administration events, we focus on accounts clerk entered invoice accounts manager approved invoice, somebody released invoice. We focus on those sub-processes because that’s what we see people doing.
And if I, and again, lemme just play this one back to you. So I’ll use a different example, insurance company,
Ahmed: of them.
Shane: a customer applies for cover. And then we know there’s an actuarial process, actually going through. But effectively when we think about it as a core business process, we always bind it to the policy,
customer applies for cover. We create, some form of policy. We then get the actuarial team to go and look and say, are we gonna approve that policy or not? And what we don’t tend to do is we don’t tend to end that off with a customer name, we tend to divorce them from that activity, even though it relates it.
So all we are saying is actually, customer has policy reviewed by actuarial team, and so
Ahmed: gets updated
Shane: yeah. Yeah. And we know that policy relates to
Ahmed: yeah and then you can answer really complex questions, really easy. So for example, as a policy gets updated, you might want to know every time the customer is meeting with the sales team the customer starts an opportunity, a customer has a meeting, customer might go into discovery, a customer might have another meeting.
And I wanna know when they had that meeting, what was the policy revenue, for example. This is really easy to do in activity scheme because you say when a customer had that meeting, what was their last before updated policy? And I can then see the revenue there. So even though the policy might be going through a separate experience, And a customers going through like a separate sales experience because I have started customer started opportunity, customer discovery, customer had a meeting, customer had their policy updated.
I can always say the last before policy updated and
Shane: And then again, we’d probably see that in the type of industry you’re in would drive. What your core concept is, so we are doing some work at the moment in the utility market, and it’s not about customer at all, it’s about asset, physical device, and so that’s, that I can infer there that actually the majority of their activities will be bound to that one schemer.
I might have some secondary schemers, but actually everything we talk about is around these assets, this bit of equipment. We don’t talk about customers at all, cuz it’s not for that part of the business, it’s irrelevant.
Ahmed: We had buildings and different things and you might, building gets acquired or building goes for review or building gets start opens and all sorts of stuff.
Shane: As we’re building out more and more customers on our platform, we’re actually looking at the metadata.
Cause we’re really intrigued by this question, based on our consulting experience, we have this gut feel that from a, who does what from a concept point of view that most organizations when well modeled have somewhere between, Three to seven maybe a maximum of 20 core concepts if they’re a really complex business with actual multiple businesses embedded within them.
Multiple unique businesses under the same umbrella. So from your point of view, how many activity schemers does an organization typically have? If they have a call, one which is customer, and then what they have normally a 3 5 20. What’s the rule of thumb for how many others you’ve got?
Ahmed: I, I think most companies have one I think probably 20 to 30% have two. I think the maximum I’ve ever seen in all the customers has been four.
Shane: Okay.
Ahmed: It’s really rare that you have more than four. Unless you have multiple product lines, , I would say every product line usually has something they’re trying to change.
And it’s really rare that you have so many things you wanna change. Even WeWork, which had many product lines, there was people, there was companies, there was people company building. And that was most of things. And then we have internal stream, which is like an employee gets hired, employee that’s a computer an employee gets for all the people analytics.
Do you have other things maybe? It’s just less common. And the point of it is not to really try to think about it like the whole point of activity schema. You don’t have to decide those things. Now, you could always add another activity, the whole point of it on a warehouse. You could always, one activity can never be used.
You can just delete it. You could always add another stream. I would like to always say is start with the questions you’re asking. See what the core entity is. You’ll just keep going. And then if you keep feeling like you’re constantly trying to add new data, add an activity, it’s fine. And if you’ve created an activity that is never used, just remove it and it’s fine.
And because everything is on top of the warehouse, we’re just really remodeling the data in the warehouse. You can always add and remove and change your mind your data’s not lost, it’s just captured in the warehouse, so you can just go in and answer questions.
Shane: Okay. So let’s just replay that back. What we see with any modeling technique is there’s always a bunch of antipas.
So within data volt we go multi-active set, where in the satellite there’s more than one active record that is an anti-pain that causes you a shit ton of problems.
Star schemers, if you have multigrain facts, you’re dead, so don’t do it. So what you are saying is if you’re using activity schemer in your organization has 20 activity streams, maybe have a look at what you’re doing, cuz you’re probably taking an old modeling technique and just trying to replace it with activity
Ahmed: Exactly. Yeah.
Shane: The second pattern that in there is this is not one data modeling pattern to rule them all, right? Because what you are talking about is the data’s in a warehouse and now you’re applying activity schema to the, that data. So you are not saying that you get your data out of your system to capture and it goes straight into an activity schemer.
And that’s the only data modeling technique you use. Is that right? You’re saying that you may use other modeling techniques if they have value, but the activity schemer is the core data modeling technique you use to ask any question. Is that right?
Ahmed: Exactly. So we like to be in the warehouse. You can add dimensional joins on data, activity schema of course, if that makes sense for you. It’s just exactly it, like it’s just the core and it’s designed to ask and answer questions. You talked about anti patterns. We say if you built another, if you built more than two streams, you should ask yourself why.
If you have more than 30 activities, you should stop and say why. Most companies don’t have that many. And then three, if you find yourself taking more than 10 minutes to answer a question, you should stop and say why. And if you’re spending more than 15 minutes to define an activity, you should stop and ask why.
Literally, these are alerts that we have in our product where if the user is opening the transformation editor and they’re spending more than 15 minutes to define an activity, we will put an alert, this is usually, this query should be simple. It should be really slight mapping, not really complicated things.
And then everything else happens in activities that makes it really easy so you don’t have to worry about it.
Shane: That’s good because again, when I looked at it, I thought for me data vault is the middle , combining layer where we combine data from multiple source systems into one place around those concepts made sense to me with then activity on schema on top, where I’m effectively making it incredibly easy to query that beast of a model.
But I’m still gonna think about it a bit more cause I’m not sure why we need the complexity in the middle. I need to go
Ahmed: You most, our customers don’t do that. Most customers go production systems, build activities on top of them and then go and answer, and it’s really, it becomes really nifty cuz you, each activity is often one concept. If you are, if an activity has more than one concept, then you’re like whoa.
That should be two separate activities. Stop doing. So it makes it really, it’s, it should be like a trivial. Mapping that you put to that, you’re just restructuring your data into the model.
Shane: Okay. So again if we think about core concepts and who does what’s and those business processes, like I said, there’s always three to seven concepts involved in a relationship. When we watch it happen. And, again, Andy Patton, you get to 20 things in that relationship, and then you’ve got a, you’re modeling wrong.
So if we think about that one of the things that I struggled with was, yes I hold the core concept of customer or person against that activity stream, but I naturally want to add the other concepts in, , I wanna do group buyers, I want to report buy those things.
So I naturally want to bind it to, there was a product and there was an order against that activity event and. The activity schemer was limited to, I think three. And then I remember you came out with somebody, I think it was, you came out with somebody that said you’ve gone into some larger customers and you know that didn’t scale.
Cuz we know that, once we get these modeling techniques, we then get all the edge cases, where it’s okay, nine times outta 10 it works. But if you happen to be doing that one thing as a company, then we do it. And so talk me through that, these secondary concepts that are bound to that activity.
Ahmed: Yeah. So what the way we designed it was we said okay, most activities have three features, so we’ll just maintain three features. And then we said, if you wanted to more than three features, you would just do a normal dimensional join to add additional features, just traditional. So those three features really, I wouldn’t, this was designed seven years ago back then json and unstructured data wasn’t really common.
People didn’t put that in their warehouse, so I was , okay, three features for most things. It’s really fast to query and then I joined when you need to do additional is fine. The honest answer is people got really obsessed with the three features and they’re , what if I want more? And I’m , you don’t really need more.
And you have dimensional tables. Like I can tell you how many companies out of the activities, it was 4% of our activities had a dimensional joints. Most people didn’t need more than three. So I was , okay, you still have your dimensional joints to use and we support it, but you don’t really need it.
But people were really obsessed. They were like, great, let’s just make it adjacent blob so you can add as many features as you want. Part of that is , you talked about anti patterns. We do a lot of checking for this and recommendations, and our whole system is designed to help them avoid this.
But an order, people are , okay, I wanna know the ad source of the order and I wanna know the attribution of the order. And you’re no. Those are derived, those are shoving things in features. An order happens what is related to the order only. But the ad source where we acquired the customer, great first.
Add first ever, start a session, grab the ad source. Don’t shove that into features. And people have a tendency to create hundreds of features and , you don’t need that. You only need the features that are unique to the order because you can always borrow features from other activities.
So if you want to know how many people often do they’ll have I don’t know, there’ll be a lot of things that you might think you need but you don’t you wanna know, when a customer perks the product, they’re I wanna know how many inventory we had in that product.
Or you might, that might make more sense to put it as when the customer views the product. So you might say, gimme last you. So we allow you to now have a J S O blob where you can add as many features as you want. We try to recommend that you don’t put more than three or five. We still support dimensional joints, so you can still do that which is helpful if you , have a table that you just wanna link to.
But we try to avoid. still give you the flexibility as many as you want with the jsun now and the structured data, but we try to recommend that you don’t shove like a million features.
Shane: Okay so let’s unpack that cuz there was two patents in there. So the first patent was I’ve got somebody coming through to the website doing a page view and I want to attribute where they came from, and our natural reaction is to store the attribution of which ad they came from against that event.
But what you are saying is it’s an event in itself,
customer viewed ad, and then customer came through and viewed webpage and now we know they viewed the ad at this time and they viewed the webpage at this time. And therefore those events activities have correlated,
so we, again, we de-normalized the. The attribute back to being a core activity itself. And then so that’s the first pattern. And again, if you’ve been trained in any other type of modeling technique you your brain works is they’re just attributes of something, they’re not core things.
So you’ve gotta untrain yourself and go that attribute’s actually an event, it’s an activity. Okay. And then the second thing is, which was my next question is, in the way we talk about core concepts, customer orders and products, and we call, talk about detail. The customer has a name.
The order has a value the product has a skew, and so that important, so what you are saying is in the original pattern, you would have three features. And to get the detail of the customer, you would join out and grab that detail from another table that describes the customer name to make it more flexible because we hate constraints,
whenever we see a constraint, we think we are losing something and we focus on, oh, we can’t do that. What we do now is we effectively land those concepts as adjacent object. And so the patent technically is past adjacent object. Get that core concept and then bounce out as a join to the detail table to say, okay, that thing has a whole lot of attributes or detail about it that relates to that concept within that
So yes, we can ask that question outta that activity schemer really easily. But there’s still a little bit of complexity cuz every modeling technique has complexity. And so this one is when you want the details, you’ve then gotta go and go out and get the details from something.
Okay? That makes sense.
Ahmed: Yeah. So yeah, you have to open the JS blob and get the details from the js blob. So the activity schema is it’s 90% of the stuff you go to, but there’s also what we call our customer dimension, which just has the name and the age and all the stuff that you put on the customer.
You might have aggregate dimension, which is like if you’re spending money, like I spend $30,000 on this campaign on this day. And those can be joined on top of the schema when you need it. But like most of the stuff you’re doing is schema and
Shane: Cool. So one of the things I love about Data Vault is because there’s so few objects, there’s so little code for every for 90% of the cases, so we talk about create hub, create set, crate link, where we create those tables, load hub, load set, load link, there’s only really six bits of code.
And the benefit of that is we harden that code, we use that code every day. That code gets immutable. We , eventually we stop playing with it. And so what I can feel about activity schema is that’s the same when we breach out to those detail tables, those, what you call dimensions is that we know it’s past adjacent field.
Find that feature. Look that feature table up because the naming should be the same, if I make the name of that table the same as the name of the feature value the feature, now I can automate it,
Ahmed: Oh it’s embedded in json so you, it is just like unpack json, we actually just literally stored json blob called feature json on that table. So in that row you just unpack it? Yeah.
Shane: . So if my features are the named as the same as the dimensional table that holds the detail, I just now have one bit of code, which is, unpack that, Jason, go look up that table, get the detail if I was writing it myself,
Ahmed: oh the details would be part of the json so you don’t have to look it up. All the details are in the jsun blob. The customer dimension, when you join on customer, that’s like standard demand. And then on the activity, it’s like the order details would just be in the feature json on that row.
Shane: So order value would be in the Jason.
Ahmed: Yeah.
Shane: Ah, okay.
Ahmed: So you don’t have to look at it. You just literally unpack it. It’s right there.
Shane: okay, but you’re doing it as a nest adjacent object, so you’re effectively embedding detailed tables for each of those features in that adjacent object?
Ahmed: Single dimensional. You never make that Jason just have an array. So within that order, so if you’re used to Snowflake, you would just say word activities, order feature js order value, and you’ll get the order value feature js.like tax amount, and that gives you tax amount.
And it’s like that. So it’s really easy to use. We don’t let you devil that. It just makes it unnecessary
Shane: So if I play it back effectively, you are taking the dimensional attributes and you are binding them in as adjacent string against the activity in the activity table, so I’m not going out to another table and then. How do we deal with change then? So customer changes their name,
Ahmed: Yeah. Great question. And now we’re like in the realm of The paradigm and then how you work around some of these things. There’s two ways of doing it that we support. One thing is that if someone is, if it matters to track change or it doesn’t matter, so let’s say you, so remember you have a customer has an attribute table and you might just update that.
Table’s often materialized, viewed, and that table might just have the updated name, but let’s say it’s customer changes, something that’s important for us to track. So you might have that as an activity. So customer updated their location.
In the narrator world, which is activity schema, with the edge cases, we do support, what’s called a slowly changing dimension where you can say, okay, I have this other table that’s customer credit card and when updated at and narrator can take that and then do in the background for you a last before join and bring that into and make it look like a dimension.
So there is those additional cases, but yeah, in activities you just do.
Shane: As I said, with every data modeling technique there’s the core things he does really well. And then there’s the edge cases that always bite us in the bum and the keys, understanding those and dealing with them. Okay, this has been awesome. We’re pretty much running our time, so I’ve got a couple of quickfire questions.
Woo. So I haven’t done this before, so this is gonna be fun. There must be a. Standard type or a number of events that you see on a regular basis for specific use cases, specific industries.
Ahmed: Yeah,
Shane: Cool. Do you publish those openly?
Ahmed: .Yeah. So we are, we’re beginning to do that with now for e-com, and we have that as what we call packages. And in next year we’re gonna actually open source. . So you’ll see it in our doc site, docs.ai. We give you the common activities per industry. We also have realized that.
People just like starters. So we, if you’re an e-comm company, you come into narrator and you have Shopify, it’ll spin up seven activities for you and you can go in and edit them. We also do that for spinning up dashboards and analyses that are just common that you do. So idea of narrators, you get started.
And we would spin up those activities, those dashboards and analyses. And we call that a package. So that’s an e-com package. And starting next year, narrator is gonna actually allow all our current customers to upload their own packages or own analysis or own dashboards. So in a world where everything’s on top of a standard structure, people can actually begin sharing their work.
Cause even though you might be you’re two e-com companies, you still have completed orders, ship product. We can talk about it and we can do analysis on it. But then when we go to implement those analysis, we often have to deal with everyone’s unique data. But in the world, everything is the same. So even though you have unique definitions for what an order is, everything else for the analysis is the same.
So we’re actually allowing customers begin sharing those things and buying and selling them across companies and just make that really easy.
Shane: Excellent. And then we all know that, we start off with one data engineer if we’re a small company. And the standard joke I have is I’ve never seen a one person data team. So we always go to three to five, create a team, and then we see some success and some value, but we get this massive backlog,
so we then scale and we go to three teams, and then we have to figure out how to break the teams up as we’re gonna do it pipelining where it’s, an awful wonderful process of requirements handed over to build, handed over to qa. Where are we gonna domain bound them? So we go you are the finance team and you are the marketing team.
So activity schema, if that’s the core modeling technique we’re, we are using, that’s gonna drive a lot of our scaling patterns,
Ahmed: yeah it’s like you said, we don’t see one or two person data teams. I see a lot of one and two person data teams. I actually have seen many companies, our biggest leads come from companies who just laid off half their data team. That’s where usually they get interested in the activity scheme or a narrator.
And usually because again, your definitions are so simple and most of the time you’re just reassembling those definitions. We often see, answering question, speed is 10 minutes. A single data analyst or two, a team of two can handle a hundred questions a month, which is unheard of.
It’s just by having a simpler data modeling paradigm, so many things go away. And so we see so many smaller teams doing a lot more work, a lot of work. When narrator had a consultancy arm, we used to have one data analyst per 10 companies. That was the ratio that we were doing, and you can see that value in just that repetitiveness and the reusability of a tool.
Shane: If we were gonna scale the teams or we’re gonna distribute them out into the business because we wanted to go to more decentralized distributed model what we’d do is we’d do it around the activity, the core concept of the activity, so we’d have a team that looked after customer activities, and we might have a team that looked after manufacturing activities.
And then the key handoffs would be when the manufacturing activities were defined and they had customer , in the sentence, then they’d notify the customer activity team, Hey, here’s another event that you need to do. Can we push it for you? Or do you wanna go push it yourself? So that’s the shared languages where your activity actually has one of the cool concepts for another activity team.
You just let them know that you’ve got this thing that’s useful for them. Okay?
Ahmed: Yeah. Also, it’s very common for you to have, because it’s a sequel and this are actual tables in your warehouse, you can just literally write a sequel query that wraps around the other stream and brings the data in. That’s really normal too.
Shane: And so one of the ones that I’m always intrigued on is this pattern of data mesh. Which I was really excited about in the beginning cause I thought it was actually gonna change the way we work. For me, everybody’s just buzz washing it now,
but if we were gonna mesh this if we were gonna push this back to the software engineering team to do the data work without us, this would be helpful, wouldn’t it? Because effectively, once you taught that software engineering team, what the activity schemer is, the, is gonna actually populate the activity schemer table out of the software that they do,
because they know they’re building a e-commerce website that relates to customer and they know what the events are. So we should about to teach them to use this technique and have them do the work without us, shouldn’t we?
Ahmed: the answer is yes and. In theory that sounds great. In practice, those people are really bad at dividing activities. So the reason why we shifted away from people like streaming events in we always get that widen. We just build a C D P with all this extra power. It’s a super powerful C D P. And the reason is that software engineers have bad practice of track everything.
They just wanna log open model, and then you end up creating so many garbage activities that are so nuanced that no one uses them. So I to actually think about it backwards, which is if you have a single data analyst answering your questions as you get questions, start defining activities only as questions arise, and then you’ll see what activities you actually need.
Cause so much of that data can exist in your database already most of people are asking about primary database things, so you already have that data and you can build activities and they end up reflecting. The concept. An example is WeWork didn’t have in our entire database systems the concept of an office or a desk.
And that’s what we sell. But WeWork in the engineering team thought about it as reservable and reservations and da. The whole point is to make activities. As closely resemble the way that customers are asking questions, not how they’re represented.
So we are very much for don’t start from what you built and say let me model it. Start from what people are asking and build models that represents that language. And therefore, because you’re going from the stakeholder’s perspective often you I don’t know who you called on your shoulder observing everything.
That observing person doesn’t give a shit of what’s in your database and what your engineering is calling it. So I would lean toward avoiding that. If you have a org where the software engineers are super tied to the business and they’re really understanding, then it makes sense. We do have CTOs of small companies that use narrator every day the flexibility of building business objects so they don’t have to query building more stakeholder representative activities and then answering questions that way.
Shane: And last one, chat, g b t. Cuz we can’t do any podcast now without
Ahmed: Yeah.
Shane: t. So we’re seeing that, everybody’s runway’s gone unless you’re a genitive AI product, we see every incumbents and every new startup doing text to sequel.
Ahmed: Yeah.
Shane: and so one of the interesting things for me is, whenever you watch those demos, you sit there and you go, that is some lovely curated data,
Ahmed: yeah, exactly.
Shane: and we and we know that joins are difficult
Ahmed: Yep.
Shane: it’s gonna be interesting to see how any of the generative ai products deal with that.
But again, one of the benefits of the activity, Scheer, I could guess is we’ve got one big table, so in theory, Chatt will love it.
Ahmed: We do, and there’s three currently open source projects on top that use chat activities schema that make that really nicely. Narrators building our own version which allows you to do a lot more complex questions.
Bart Moses, founder and CEO of Montecarlo recently wrote an article about the only way to get generative AI to ever become very popular in data is to have it on top of a standard schema. And she references narrator and activity schema.
Cause that’s currently the only real standard everyone else has modeling paradigms. And they let you build whatever tables you want. You have to think about it. And it’s up to the engineer or the data engineer who builds it to represent it. Activities came like no. Rules and there’s one way of structuring the data and you’re gonna follow it.
And because of that rigid structure and the fact that you’re only using activities in temporal joints, it has opened the world for a lot of tools like narrator to exist, but also chat t p t to be able to that’s just such a controlled space. And the language that the human is asking questions in is modeling the same language that na, that the activity schema uses.
So you’re using words like in between and first and last and made payment and when the customer called us. And it, so it makes the activity schema is I believe to be the only way that chatt PT will ever work on top of data. And I think a lot of people will share that belief is that this is a standard structure and the more people that abide by that standard structure, the more we can share, the more we can reuse, the more we can have chat C p T on top of it.
And I believe in that’s what I believe in the world. That’s why I started it. My first ever raising was about sharing and reusing. Analytics and your work and it has to be on top of a standard structure that still supports every company to customize their definitions. And I does that really nicely.
Its you to control over definitions, but reuse.
Shane: I think I agree with you partially, and then I’m going to disagree with you partially, I agree that the more you get highly opinionated products, the faster and more reusable they become. And then you always gotta watch out for the downside, which is, by being highly opinionated, you’re removing some choice that has some consequences,
but that’s okay, cuz you get the benefit of it’s faster and it’s more bulletproof and all that kind of stuff. And so for me, with chat G P t, your highly opinionated structure is the schemer itself. I think there’s another pattern which we’ll see a lot of, which is that there’s config metadata.
Which holds the opinion and the data structures themselves are a little bit more flexible. But then the, same outcome because we now have an opinionator that, that holds the context, the semantics, and then hints all that to chat G b T, so it doesn’t have to worry. So I think there’s two implementation approaches of the same pattern, which is standardized things, and therefore the engine knows what to ask in a repeatable way every time.
Ahmed: Yeah, a hundred percent. And I think semantic layers, I’m like a very big fan of and the thing that I would always say in anatomy this context is that if you’re a company that’s answering a series of questions that you can define a semantic layer on top of you should do that.
Like you should definitely do that. I think that the things that we notice with especially faster away startups is that the questions are always unique. They’re just off the top of the head. So sitting and trying to imagine the semantic layer ends up having way too many like we talk about Airbnb is like data dictionary and they’re metrics layer.
They have 10,000. At that point, you can’t differentiate column 972 and column third 322 because the word seems so similar. So at that point, symmetrically is garbage. The activity scheme is really good for open-ended, fuzzy, repetitive question. Once you build activity schema and you start answering questions and then you’re doing it, some of those questions you just wanna fucking put in a table and just reuse.
We allow, do that in narrator, materialize it, put it, the whole point is when you have a new question or you wanna add a dimension or you wanna slice by this and you want narrator makes that really easy. An activity schema lends itself really nicely to that chaos building with more and more questions.
So if you have a clean structure that you want to represent and model go ahead. I think that’s what some of these paradigms help you reuse. If you have a lot of new questions and there’s fuzziness and there’s dynamicism and you’re like dealing with a chaos situation. Do it. So yes. If you’re Nike and you have quarterly metrics that you’re doing, or you have CFO reports that you need to be very specific. Don’t use activity team for that. Those are CFO reports. They need to be specific way do it. But if you’re a machine learning team or you’re a data analyst or you’re at a startup or your company is growing really fast and you have hundreds of data requests, that’s when the activities team really lends well because that speed is really important and you don’t need that.
You need that, flexibility and speed just will carry you through victory. Versus the world of a c CFO report where it like formatting and organizing and it’s very specific That built it. Use a sequel query, put it in a semantic layer, model it like do it no normal way.
Depends on which modeling use. But yeah, for fast questions, iterating, answering questions, really there’s nothing better than.
Shane: It comes back to that, there are patterns that are brilliant in a certain context and there are patterns that are harder in other context. And there’s patterns where actually it’s an anti-pain, if that’s your context. Most people, it doesn’t work for. I still say have a go cuz you might be lucky
but the idea is experiment. And so again, one of the things you said a little bit earlier, and I didn’t reinforce it, is this idea that it’s an agile modeling technique, you start small, start with, typically customer activity schema, and then add a few events. Few activities, and then you add a few more, and then if you need to break out to a second stream, do it.
And then if it then doesn’t have any value, kill it. Yeah. Cattle, not pets, that’s what we’re after in data. Excellent. Hey, look that’s been great. If people wanna get hold of you, how do they find you and, talk to you, see what you’re writing about, find out what you’re doing, what’s the best way for people to get in touch?
Ahmed: yeah. You can add me on LinkedIn. My email is just ahed narrator.ai. Feel free to ask, email me any question. If you come up with a Edge case that you’re , how does narrator work in that case? Shoot it. I love those edge cases. And three, check out narrator narrator data.com. It’s, we can experience narrators, sign up, try it.
It’s a really, it’s just it’s designed to be the, plus all the edge cases that we found working with hundreds of companies, sointeract with every single person who sees this.
Shane: Thanks for coming on the show and I hope everybody has a simply magical day.
Ahmed: Awesome. Thank you.