DataCafé

[Bite] Data Science and the Scientific Method

DataCafé Season 1 Episode 15

The scientific method consists of systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses. But what does this mean in the context of Data Science, where a wealth of unstructured data and variety of computational models can be used to deduce an insight and inform a stakeholder's decision?

In this bite episode we discuss the importance of the scientific method for data scientists. Data science is, after all, the application of scientific techniques and processes to large data sets to obtain impact in a given application area.  So we ask how the scientific method can be harnessed efficiently and effectively when there is so much uncertainty in the design and interpretation of an experiment or model.

Further Reading and Resources

Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning them. They are provided as is. Often free versions of papers are available and we would encourage you to investigate.

Recording date: 30 April 2021

Intro music by Music 4 Video Library (Patreon supporter)

Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.

Jason:

Welcome to the DataCafe. I'm Jason.

Jeremy:

And I'm Jeremy. And today we're talking about the scientific method.

Jason:

So I thought we'd stop in for a bite today, Jeremy, because I've had it in my mind to talk about the scientific method and what that means for data science and for data scientists. So I'd like to pick your brain about it a bit.

Jeremy:

Okay, that's interesting, because I mean, I'm not sure I, I come from a traditional scientist, background, I'm a mathematician by training, and then I got into sort of a hybrid area of computer science, but you are the scientist, Jason. So I'm interested. Yeah, you know, almost in rebounding the question and saying, Well, what does the scientific method mean to you as a physicist, as an astrophysicist?

Jason:

Yeah, yeah, I guess. So we've talked about, like, the scientific method has come up a couple of times in our jobs. And I think we've even mentioned this in the cafe a couple of times. So as we go through our scientific training, and we set up experiments in school and college and fundamentally gather our data and make our observations to prove or disprove a hypothesis, and this whole enterprise, I guess, falls under the realm of the scientific method. So, right, you can draw it in a pipeline, and there's all these pictures online. And I don't think it really flows so straightforwardly, as that course, if I look at a picture and say that, you know, main boxes in this process flow, generally, or historically, you observe something in the world, and it raises a question. And that question is a topic of interest and may form a hypothesis. So you decide to run an experiment. And to do that, you need data from the experiment, the process can gather your data, or you may have some data, when your experiment report on the conclusions. And if there's something interesting that like, proves what you're taught was true. You tell everybody evade? Yes. And then other people repeat that. And if enough people repeated enough times, confidence is built that your hypothesis, yeah, wasn't a fluke.

Jeremy:

So there's an interesting set of concepts there, which I think sort of form the essence of, of a scientific method. It's the it's the, the observation based aspect of it, you're taking measurements, often you're you're you're you're getting data so that that fits right into data science straight away anyway, you're constructing a hypothesis around how a phenomenon or an object works, and you're testing that hypothesis. And so there's lots of, you know, associate techniques that you might have to develop around that. And of course, then you've got sure I love the bit that you said at the end the repeatability, yes. How important that what you've done? You can you've written it down in such a way in traditionally in the scientific journal, of course, or but not, not necessarily these days. Yeah. And then so another group can take your data, take your algorithm or take your approach and go, yeah, we were able to do that. And we got exactly the same results. And within the margin of error, I guess.

Jason:

Yeah, exactly. They replicated it in some way. And there's a big push nowadays to make sure that everybody can replicate it. So we have versioning, let's say so you can say exactly what version of your software that you used, was what got you this result. And it allows for things to advance but still be recorded historically. But what is on my mind is that in data science, and we talked about the complimentary team before you bring together a lot of different scientists, and they will have fundamentally use the scientific method in their field. But in data science, we kind of have this funny world where there's now a wealth of data being gathered in various ways that may not even have been set up for a specific purpose. You know, you could have so much data from Twitter, let's say, and everything that's going on on Twitter, and it gets really difficult then to mine it in the way that you're setting up for a hypothesis to set up a construct where your data is clean, or representative or the right sample. You know what I mean?

Jeremy:

Yeah, well sort of, but I'm sort of gonna throw it back to you and say, well, it's not like the sun was set up to be, you know, mined and analysed and and, you know, investigated and picked apart was it so, these things are not there for our convenience. In that sense, it's quite realistic, I think.

Jason:

Yeah. And the world is messy right.

Jeremy:

Yeah, right. I think that's, I think, ask any data scientists, and they'll tell you, they'll tell you lots of stories about how exactly how messy The world is.

Jason:

Yeah. And I mean, to put it in perspective of like, when we say something about, like, sending satellites up to observe the sun, there's so much riding on that, that we set it up, right, we put a lot of investment into will, what is the exact hypothesis? What is the exact experiment? What is the exact data? Because it's really expensive? If you get any of that wrong? Yeah. Whereas in the world of Twitter, when I gave that example, there isn't any of that it's just the characters being spewed out by everybody all over the world using that form. And I just wonder, Is there something to the scientific method that we need to make sure is maintained? When you take any sort of off the shelf model or build a piece of software that fundamentally has assumptions in it fundamentally makes a hypothesis has a business decision, maybe at the end of it? And there needs to be a scientific rigour that comes with forming the hypothesis, and that's what the scientific method can allow?

Jeremy:

Yeah, I, for me, it was always, it was always about the sort of the test and learn, right, I've got a phenomenon, it was rarely a phenomenon, I had to say, in computer science, it tended to be I've got a white box, not for my particular brand of computer science of black box. And, and I'm applying it to a particular data set. And I wanted to understand or a particular environment, I wanted to understand, how am I algorithm impacted and reacted to that? That data that environment? Yeah, so I might be trying to construct a ranked measure of which web page is really important or something like that. And there's loads of parameters involved in this. And some are, some are significant, and some are not. And so what I would tend to do is say, right, well, let's, let's tweak a parameter, and see what what impact it has. And then there'll be a measure at the other end of the process, which looks at the output and goes, this has had a dramatic impact, you know, I've seen a significant improvement because of that tweak in the parameter. That's the assumption from from a data science perspective, this, this transfers really well, because you've got this idea of keeping the vast body of your setup the same, changing one thing as much as possible, leaving everything else the same, and seeing, seeing what the impact is of that one, one change. And ideally, being able to infer This is a bit of an assumption that the change you then saw was as a direct result, of course, from the change that you made. Yeah, what do you get from that? Well, you get the learning, you get, you get to create a hypothesis about your environment, or about your algorithm or about your, about about your problem domain and your decision that's being impacted. And which, which then can lead to more experiments, more tweaking, and testing and learning and all of that sort of thing. So I think what we're fast getting towards is it, you start to get a nice, looping, iterative process.

Jason:

Yeah. Which is really important. Because even when you draw the scientific methods, you can draw it as a loop and say, you come out of one experiment with some conclusion, and that's fine. We can replicate it. And that's proven. But we also have 10 new questions. And we want to run 10 new experiments with 10 new versions of data. You know, go through that test and learn that you're talking about.

Jeremy:

And there's the problem or me maybe a problem, if for many years, the excitement, right, which is, oh, wow, I started with something I thought was really straightforward. And now I've got something which is I've got, you know, I've got 10 possible data sets and 10 possible questions on each of those data sets. So and I've got 100 investigations to carry out. And that I think, in, in a scientific environment in especially in academic environment, yes, that's great. That's just grist to the mill. That's exactly what you're expecting you can you can divvy that up amongst your PhD students or graduate students, you could, you can have a plan for how you're going to tackle investigating and prioritising those over the next three years. Right. Yeah. But that doesn't quite work in a commercial setting and in Data Science.

Jason:

Yeah. And this is where I think we need at the core of our efforts, the scientific method to be understood so that we follow the correct process of forming our hypothesis and knowing that our experiment, experiment is valid, and the data is valid for the setup that we have for us. I think going into it. We also have at the outset, to know the decision to know the impact to know the cost where the line gets drawn. And how much do we need to verify our assumption Since How much does our hypothesis need to be proven? Before the impact is realised before that decisions taken? Because your business model, you know, depending on the context might hinge upon it might be something subtle, or it might be a massive change in your business operations.

Jeremy:

I think that the the change, I noticed, when I started working in industry a few years ago, now, the chain, the change, I noticed was how important it was to get a really salient and to the point hypothesis, out of the out of the box, right. Yeah, to get the exact concept that you were trying to test to trying to prove that if you did show should be true, or not true, would would enable you to access the decision access the the impacts that your your data science algorithms Exactly. But you know, see, what you didn't want was a sequence of interesting investigations, which in the end, it gave you answers to some questions. But those questions didn't really help you with

Jason:

Right. I think that's where our stakeholder management is so important, because when we set out to run one of these scientific experiments on the data set that we have, and to answer a question, it's exactly what you just said, we need to bring that back to the stakeholders version of the question, what is it that they're going to actually say is the hypothesis that they had that was interpreted then in the experiment set up that we adored the model that we may have built, and I think that that's where it's important for us as scientists to bring them into that way of thinking. And that's possibly where I see opportunity for a disconnect that needs to be seized upon, you know.

Jeremy:

I see that in in many projects where I have, you know, well, meaning stakeholders, approaching the team, some, you know, at a critical point in the project, and they will, they will show this disconnect very straightforwardly by just saying, how long is that going to take? Of course, yeah. And yeah, as a data scientist, and as a scientist, that that instantly causes sort of flags to go up and sort of alarms to ring because I'm going, Yeah, it's an investigation, I can't really, I can't really give you a how long, which, of course, is, you know, if they're, if they're used to doing some kind of Gantt chart based project plan, then then that causes instantly causes a bit of a problem. So I think there's some really interesting and nice modifications, then that come from that uncertainty at the heart of this data science and scientific process, which is to say, I don't know, if I knew the answer, I wouldn't have to do all this fancy stuff in, you wouldn't be paying a bunch of scientists to do this investigation. If I knew the answer. The whole point is, we don't know the answer. And we're gonna have to work out what that answer is, before we can really progress this to the next next stage of the project.

Jason:

Yeah. And it raises the questions when they're asking, how long will it take? Were we asking the right question in the first place? Because what you should be asking is, do you have an answer? Yes? Or what is it that we need to do to get us get confidence about it? Is it that we need more data that we need more resources that we need a new hypothesis to be visited with? Maybe a new set of stakeholders?

Jeremy:

Yeah, and sometimes you get lucky. Sometimes you can say, look, we can get you a certain quality of answer. And we can, you know, you tell us that you you tell us how long we got almost Yeah. So if you if you've got a couple of days or a week, I can get you something. So depending on the problem, if it's like an investigation, and you're sort of slowly refining your answer getting better and better. And you say, Well, I can get it to this good. And I can probably even give you an approximation of how how close this good is to what you want for your, you know, statistical matching algorithm or something. But on the other hand, if it's, if it's an investigation where you don't even know whether there is an answer, is there a signal which means that I can identify a cancer in a in a medical scan or a medical image? And I just don't know whether whether I can see. So answering the question, how long will it take is much harder then.

Jason:

Yeah. So I think at the core of this, we need to make sure that scientists can adhere to the scientific method boss. There is fundamental outline of what the question is upfront that needs to be agreed and need to be, compartmentalise, how much time is allowed for pure research or formation of hypothesis or gathering the data? Yeah, there's frameworks to do that. You know, we can have a whole discussion around that another time. But that iterative process.

Jeremy:

I know there are companies who do just investigative epics at the beginning of their projects where they say this epic? Yeah, it's two weeks sprint, this four weeks of sprint, whatever, we're just going to be doing investigates that's definitely one way of doing it. But I think for me, the other thing, which I didn't mention I talked to Dan was that when you're as a team, when you're setting these hypotheses, you should absolutely carry those hypotheses through to your, the way that you set yourselves tasks. And so your hypotheses should really I like, I like my heart hypotheses to be questions. I because I like them to have a yes or no answer. And for one Have either of those answers to be a valid possibility? Whereas in traditional project management, typically, you'll say, well, you'll have a block of time that's associated to put the roof on the house. The answer, if the answer to is well, we couldn't, that's not really acceptable. But in some, but in data science, it's absolutely fine to go, can we find a nice, you know, customer preference metric, which tells me what film that an individual is going to enjoy watching? Yeah. And the answer may be no, not in the way that's useful to you right now. In which case, you need to be formed the question you need to think about another way of presenting the problem, you know, there are lots of ways of them rebounding that and going okay, well, we still need to make progress, but maybe we're making sure to make progress down another another arm.

Jason:

I think, to summarise just off the back of your analogy there it's a case of we were putting a roof on a house but this might be a house we've never built before, and a roof and a format we've never used before. So our old timelines don't hold.

Jeremy:

Using materials that no one's ever used before. Exactly. Exactly.

Jason:

Thanks for chatting this out with me, Jeremy.

Jeremy:

No worries, Jason, that was fun!

Jason:

Thanks for joining us today at the DataCafe. You can like and review this on iTunes or your preferred podcast provider. Or if you'd like to get in touch you can email us Jason that datacafe.uk or Jeremy at datacafe.uk on twitter at datacafepodcast. We'd love to hear your suggestions for future episodes.