DataCafé

[Bite] Why Data Science projects fail

June 21, 2021 DataCafé Season 1 Episode 17

Data Science in a commercial setting should be a no-brainer, right? Firstly, data is becoming ubiquitous, with gigabytes being generated and collected every second. And secondly, there are new and more powerful data science tools and algorithms being developed and published every week. Surely just bringing the two together will deliver success...

In this episode, we explore why so many Data Science projects fail to live up to their initial potential. In a recent Gartner report, it is anticipated that 85% of Data Science projects will fail to deliver the value they should due to "bias in data, algorithms or the teams responsible for managing them". There are many reasons why data science projects stutter even aside from the data, the algorithms and the people.
We discuss six key technical reasons why Data Science projects typically don't succeed based on our experience and one big non-technical reason!

And being 'on the air' for a year now we'd like to give a big Thank You to all our brilliant guests and listeners  - we really could not have done this without you! It's been great getting feedback and comments on episodes. Do get in touch jeremy@datacafe.uk or jason@datacafe.uk if you would like to tell us your experiences of successful or unsuccessful data science projects and share your ideas for future episodes.

Further Reading and Resources



Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning them. They are provided as is. Often free versions of papers are available and we would encourage you to investigate.

Recording date: 18 June 2021

Intro music by Music 4 Video Library (Patreon supporter)

Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.

Jason:

Welcome to the DataCafe. I'm Jason.

Jeremy:

And I'm Jeremy. And today we're talking about failure. Actually, have we realised we have a birthday to celebrate?

Jason:

We have a birthday? Yeah,

Jeremy:

We are one. How about that?

Jason:

One year old on the air? It's so good. I didn't realise that a year has like crept up on us. And what a year it's been. It's been such a pleasure

Jeremy:

It has it really has, and I didn't, I didn't, I didn't know what to expect from this but I'm blown away by the fact that it's gone a year, the fact we've done what this is the 17th episode, is it and...

Jason:

I don't even know actually.

Jeremy:

17 episodes not bad going. And I certainly didn't think we do that many. And it's been such fun. Loved it.

Jason:

Yeah. And major. Thank you. And shout out to all of our brilliant guests for that month in fantastic. And I know that you've engaged with them a lot. So thanks to them for their time.

Jeremy:

Yeah, they make it what it is. And it's been it's been fantastic talking to all these really interesting people, certainly.

Jason:

And of course, our listeners. Thank you to all of our listeners tuning in.

Jeremy:

Absolutely.

Jason:

And now, what we're going to talk about today is failure.

Jeremy:

I think this is good failure sort of anyway.

Jason:

Yeah, I think so insofar as I've been reading these reports about why data science projects can fail, at least motivating it from a data point of view. So the data says that data projects can fail and up to 80. What is it 85% of big data projects fail according to a Gartner report in 2017.

Jeremy:

Wow.

Jason:

Which is massive. Yeah. And I think they had a 2020 report that said, CEOs say that the ones that land only about eight percent of them generate value for the actual success rate is really low. Gosh, and the reason why question is from the point of view of Yeah, just because something has failed, doesn't mean you haven't got a lot of learnings that then mean, you're more likely to succeed next time round. And so that's obviously not captured in these reports.

Jeremy:

That's one way. But I mean, obviously, it can go the other way, which is you've had your fingers burned. And so you never go that down that route again. Right. So true. Yeah, true.

Jason:

Yeah. And so I think what we can chat about today is some of the points that we could maybe avoid, or help people avoid for your own data science projects failing, potentially.

Jeremy:

Exactly. So although this is about failure, really, it's about let's see how we can learn so that we can succeed. I think that's, that's the upper. That's the real upper. Yes. So I want to play a game with you today, Jason then so if you're, if you'll indulge me,

Jason:

I love games.

Jeremy:

I have, I have, I reckon, six pretty broad reasons why data science projects are prone to prone to failure. And it can be combinations of those reasons. Of course, it can be more than one. But I have six core sort of reasons why. And I call those the micro reasons, but they are they're really important. But then there's a big Whopper. There's the big reason why I think a project projects fail. So I want you to maybe from your own experience, or from experience you've read about I want you to have a crack at seeing whether you can you can get some of these, and we'll have a chat about that.

Jason:

Okay, so this is like, data project. Bingo. I hit them all. Okay. Oh, yeah, we need all those same words, that should have been a one year gift to us. Yes. Okay, so I have to just come up with something that immediately to me is something I would like seen as a possible failure from the exposure that I've had to projects. And number one on my list is probably engagement, right? and stakeholders need to be engaged, engaged with you as a solution provider engaged with the problem, and the company engaged with the strategy of the company, because you want it to align with their overarching KPIs. And once they're on board, great, that's like a major obstacle number one.

Jeremy:

Yeah, that is definitely that is definitely one of the key reasons why projects fail. I think if you don't have engaged stakeholders, I've worked in several projects where stakeholders have fallen out of love with the idea of doing the project or never really bought into the idea of doing the project in the first place. Or don't realise sometimes sometimes some of the person said, Is this the person setting the project up having this great, oh, we should talk to the dead science team. They're really great. They're, they're not they're actually nothing to do with the pathway to, to delivering that project. And so the people who actually have to work with the tools that you develop, are going, sorry what's this, where did this come from, we didn't ask for this. And so that's a real issue. And of course, they're just as much stakeholders, in fact, really, they are the stakeholders in this case. So yeah, stakeholder engagement number one, that's a really good really good get. Well done.

Jason:

Oh, I should get a coffee. Oh, yeah. Just sitting on that victory, like, great. And get another coffee here. And number two, the data itself has to be clean, or at least in some form that's usable. Absolutely. I was at a talk this week and somebody referred to their data. I don't know what to call this, but they called it a swamp. It might have been a cloud or might have been a lake somewhere, but they refer to it as a swamp. And I was like, wow, yeah, there could be a swamp of data that people are saying, Let's get some scientists in to look at this swamp. And it's like all we need to prep the data better than this. Yeah.

Jeremy:

Yeah. How it's being collected, when it was collected, and tell if that collection mechanism has changed, as it always, always, almost always does feel good time. Who knows about the data? You know, I think organisations that do this, well typically have curators, you know, who are really experienced, really savvy people who look after, in sometimes individual tables, and they they know, they know everything about the columns, the metadata, the provenance of that data. Yeah. And and they can they they are the go to people for that, that data set and allow you to then say, oh, what can you tell me all about that? That how we're collecting that that longitude and latitude? And what level of accuracy we have? And when we changed equipment? How did that affect the the output? All of this stuff is just so so important, but that so often, you only discover when you're well buried within within the project, quite how parlous a state it can be in.

Jason:

I've engaged with some data analysts who blow me away their knowledge of the depth of the data that they have.

Jeremy:

Right? Right, exactly. Sometimes it's community knowledge, you know that a team has that data, that knowledge about the data, or the or some stakeholders have the knowledge. But it's rarely in one place, I find, you know, you often have to search and hunt high and low to find the people who don't really know. So here we go, stakeholders and data we have two good hits straight off the bat. Yeah,

Jason:

This is this isn't a bite episode anymore, though, we're going to be here for a while. I've a list to go through them. And number three, one of the things that I think is the actual team, the teamwork, the skills that are brought together, we talked about the need for diversity of skills and team for em, you can bring in any amount of bias or a lack of understanding, if you just have a skills gap, maybe in your team or just one person working in isolation.

Jeremy:

Yeah, yeah, you've hit hit another one here. So yeah, the science, the team who unders have that scientific knowledge, or don't maybe, you know, it's such a big subject, you can have a team of 15 people, and maybe you don't have coverage of, you know, Transformers in NLP, you know, deep learning neural networks, it requires a huge degree of expertise in a really broad area. So, I'm gonna roll into this as well, not just the not just the coverage of the scientific method, but also the ways of working with that scientific method. You know, are you? Are you a team that goes, I want to go down this rabbit hole, I'm not going to come out, until I've extracted every last percentage of goodness, or are you a team that is is is prepared to do maybe quite shallow investigation initially, and then and then deepen it as necessary? Because I think a lot of the reason some projects go go south is is when individuals maybe decide they're going to just go go after one particular problem that's fascinating them and not actually focused on on the big picture. And the science becomes sort of almost more interesting than the business problem. And that can be a bit of a bit of an issue I've seen, certainly,

Jason:

Yeah, it can suck people in.

Jeremy:

Yeah, it can. Good stuff. This is going really well. This is this is this isn't gonna be too long. So we got lost, we got science, we've got data, we got stakeholders what do we think?

Jason:

Well, one bit we were talking about there made me think of deployment, like getting what you've built, deployed in the business and productisation. And one of those skills gaps could be the ability to actually get a model, automated, whether whether it's versioned and tested and actually implemented then, for your user, whoever that might be, stakeholder, customer.

Jeremy:

Yep. Yeah. Yeah, it's absolutely the there are many issues around this. And this is probably not one this is probably several reasons but if you don't have good engineering capability, within within or adjacent to your team or within your project if you're working in a cross functional team Then then it's going to be a real struggle. And, you know, data science teams that have good engineers within them are always more capable, I think and are in a much stronger position to just to demo what they've done to stakeholders quicker. And then if you get demos, then you get feedback, you get feedback, you get improved product, you get all that virtuous circle stuff going on, don't you? Suddenly, I call out particularly in this, which is a bit of a curse of the data scientists sometimes, which is that that you're working on a problem, you're developing a model. It's a really nice model and it but it's quite computationally intensive. And, and I've seen this several times, whereby the data scientist pulls out the solution, after maybe half an hour, or an hour's worth of processing from their, from their model. And well, you know, guess what happens that the stakeholder comes to them and says, what an hour? I was hoping that was going to take two minutes. Yeah. Well, I've only got 10 seconds for that results come out to the for it to be useful. You know, the engineering doesn't just cover, being able to deploy it, it's actually got to be done in a timely way as well, which is really hard.

Jason:

I'm just going to run this notebook in my laptop in my locker overnight. And then and then the results, come out.

Jeremy:

Exactly. And of course, you know, this is data science, right, it tends to tends to be models, which are consuming inordinate amounts of data. So I mean, we're not, you know, there's a big scalability issue to address. It's not really one issue. It's probably two. Yeah.

Jason:

And which can be addressed. But yeah, it's not always immediately addressed. Yeah. So any other guests, many, how many of

Jeremy:

You've got four. You're doing really well. So it's to to go and you know, false guests, but I haven't say no surprises here yet.

Jason:

No surprises. Yes. Maybe over promising. And what I mean by that is sometimes you say, yeah, we'll build you. This could be echoing the rabbit hole will build you this amazing machine learning model. Turns out, they just needed something quicker, like demo that you just mentioned. He did a demo, he did a quick cycle. This is a deliverable. And something I've heard about, or people have said, is, if you aren't embarrassed by your first demo, then you've demoed too late. That's good. You over promised something and you end up spending too long.

Jeremy:

We're just, we just at that stage of the project at the moment. Yeah, we want to go back to the customer really quickly. And I'm thinking this is so simple. He's got this gets embarrassing.

Jason:

Sounds like you're right at that point. Yeah.

Jeremy:

That's a really good point. I like that one. It's not it's not I haven't got over promising on my list. I guess I probably would have rolled that into stakeholder. But it's it easily be ways of working but. Yeah. Okay. That's good. I like it. So we alluded to it a little bit, I get to give you a hint. So we alluded to a little bit with the skate the scaling issue, because if a computation is needed in a particular time, it's usually needed for a reason, right?

Jason:

Oh, is this outlining the problem from the outset?

Jeremy:

Right, right. Right. Right. So this is this is what, what is it feeding into? Right? What is what does the downstream look like? Right? What is the decision that we're actually influencing? With our Yeah, I saw, I saw a nice thread on LinkedIn a couple of days ago, where a guy was going, Oh, you know, so, so often. So often people are using their data science teams to generate insight. And you know, what, what really does that achieve, you know, if you if you're putting your insights into a dashboard, or, and he said, Well, why don't I tell you exactly what happens, you produce some nice graphs, and they get put into a PowerPoint deck. And then a month later, it gets presented as a meeting, it's like, well, you're already you're already a month behind the curve, when you do that, but probably two or three, given how long it took to run the tool, and then get it get it across to somebody. So you know, if you want impact from a data science team, you've got to design the forward process that takes that science and show which decision is going to be influenced by it. And therefore, of course, show the impact that then comes from that.

Jason:

I like that.

Jeremy:

That's a big one for me, I definitely preached that but a lot.

Jason:

I was thinking of something similar when we were talking about stakeholder engagement, about the roadmap that along that roadmap that you were talking about to get impact. There'll also be a bunch of like, gate points or decision points that may have a stakeholder who needs to be brought in to get their buy in for the value to be driven. Yep, that Yeah,

Jeremy:

Absolutely. So I mean, there's there's there is overlap in the so the stakeholder engagement Yeah, the with the decision with the impact these these of course, are all critical things that that play together and that none of these tend to sit in isolation. So the last one we have we have five now we got a we got one more to get this it's a slightly fiddly one. And it depends on the industry that you come from as to whether you care about this or that you everyone cares about it, but whether it's topper most in your mind, I think

Jason:

Interpretability?

Jeremy:

Ah, no pos Yeah, actually, I might have to just give that to you. Yeah. Okay. interpretability usually means that you want to know why. And if you want to know why the tool is producing the decision, typically there's a regulator sitting not a million miles away from you saying, you have to show me why you've made that decision, maybe in a medical healthcare setting, or in a, in an ethical governance setting or in a statutory setting. You know, there are lots of industries finance, for instance, where there is a huge regulation framework that has to be not just not just carefully curated and adhered to, but it has to be shown to be adhered to, you really got to make sure that you're compliant with your regulator, with your industry, with your even with your company policies, of course. So all of that has to be part of part of writing the tool. And, you know, any data scientists will know if they've done this for any length of time that you fast become very, very aware of and then expert in the regulatory frameworks that you're operating under.

Jason:

Yeah, that's really interesting.

Jeremy:

So there's one last one last challenge, which I think for right now, I just want my prize. You get your prize, prize. Okay, so it's not a challenge anymore. It's really just to get a last question. So those those I reckon, are the not so micro reasons for why data science projects fail. But there's, there's something which I think ties all of these together, and the companies that succeed have it, and the companies that that struggle anyway, shall we say, are ones which maybe don't have this for this particular property, you want to hazard a guess as to what that might be?

Jason:

Oh, like, I wanna say, investment from the top.

Jeremy:

Yeah, pretty much. I mean, it's, it's basically culture, I think, because you've got to, you've got, you've got to have that cultural desire to embrace what can be, I think, a really radical departure from your usual ways of working and, and, and also, you feel reasonably comfortable with quite quite quite abrupt change in the way that you do things. And that can be a big struggle for some companies, I think, certainly, I both worked for and and and seen this distance companies that have really struggled with that cultural standpoint, where getting getting changed into into the company and adopting this new technology, disruptive technology, and it is highly disruptive sometimes can be a real struggle for them. I don't know if you've had any experiences like that.

Jason:

Yeah, that point you just said about disruption's pretty key. And I've heard people talk about needing to embed an allowance for curiosity, and the psychological safety that comes with being okay with not so much failing, but experimenting, and your experiments may not deliver. But that's why you ran the experiment to see what needs to change. And any aspect of all the points we've hit on could be what needs to change.

Jeremy:

I love that and allowance for curiosity. I think that's a really nice phrase. It ties up very nicely with how a lot of good teams execute their projects, which is to say, they don't say, we will produce this model, and it will do this, they say, Can we produce a model, which actually impacts the company with 5% improvement in whatever metric and and you know, it's just as okay to come out with a no to that question is is with a with a yes to that question. Because you know, it both outcomes are likely Right. Yeah.

Jason:

Very good. That's been really enjoyable.

Jeremy:

Well, I guess I think it's safe to say that, you know, culturally, DataCafe has ticked all the right boxes. We're still going off for a year we have we have a high allowance for curiosity on this show. So

Jason:

Yes, very much so. Ah looking forward to what the future holds. Thanks for joining us today at the DataCafe. You can like and review this on iTunes or your preferred podcast provider. Or if you'd like to get in touch. You can email us Jason at datacafe.uk or Jeremy at datacafe.uk, or on Twitter at datacafepodcast. We'd love to hear your suggestions for future episodes. Can I? I can shout Bingo.

Jeremy:

You can shout Bingo. Excellent