Bayesian Inference: The Foundation of Data Science Artwork

DataCafé

Welcome to the DataCafé: a special-interest Data Science podcast with Dr Jason Byrne and Dr Jeremy Bradley, interviewing leading data science researchers and domain experts in all things business, stats, maths, science and tech.

All Episodes

DataCafé

Bayesian Inference: The Foundation of Data Science

March 23, 2021 • DataCafé • Season 1 • Episode 12

In this episode we talk about all things Bayesian. What is Bayesian inference and why is it the cornerstone of Data Science?

Bayesian statistics embodies the Data Scientist and their role in the data modelling process. A Data Scientist starts with an idea of how to capture a particular phenomena in a mathematical model - maybe derived from talking to experts in the company. This represents the prior belief about the model. Then the model consumes data around the problem - historical data, real-time data, it doesn't matter. This data is used to update the model and the result is called the posterior.

Why is this Data Science? Because models that react to data and refine their representation of the world in response to the data they see are what the Data Scientist is all about.

We talk with Dr Joseph Walmswell, Principal Data Scientist at life sciences company Abcam, about his experience with Bayesian modelling.

Further Reading

Publication list for Dr. Joseph Walmswell (https://bit.ly/3s8xluH via researchgate.net)
Blog on Bayesian Inference for parameter estimation (https://bit.ly/2OX46fV via towardsdatascience.com)
Book Chapter on Bayesian Inference (https://bit.ly/2Pi9Ct9 via cmu.edu)
Article on The Monty Hall problem (https://bit.ly/3f1pefr via Wikipedia)
Podcast on "The truth about obesity and Covid-19", More or Less: Behind the Stats podcast (https://bbc.in/3lBqCGS via bbc.co.uk)
Gov.uk guidance:
- Article on "Understanding lateral flow antigen testing for people without symptoms" (https://bit.ly/313JDs9)
- Article on "Households and bubbles of pupils, students and staff of schools, nurseries and colleges: get rapid lateral flow tests" (https://bit.ly/3c5ZXih)

Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning them. They are provided as is. Often free versions of papers are available and we would encourage you to investigate.

Recording date: 16 March 2021
Interview date: 26 February 2021

Send us a text

Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.

Jason: 0:00

Hello, and welcome to the DataCafe. I'm Jason.

Jeremy: 0:03

And I'm Jeremy. And today we're talking about Bayesian inference.

Jason: 0:17

Bayesian inference based on Thomas Bayes. I didn't mention to you, Jeremy, were you with us when we visited his grave? In London?

Jeremy: 0:25

Yes, yes. In a local London cemetery.

Jason: 0:27

Yes, Bunhill Fields.

Jeremy: 0:28

That was that was a real point of pilgrimage almost for any any data scientist.

Jason: 0:34

Yeah, it's an amazing cemetery actually just to visit as a tourist. And he is the person who came up with Bayesian statistics as a really cool area of statistical inference. So what is Bayesian inference?

Jeremy: 0:47

Bayesian inference is, I think, one of the sort of go to approaches as a data scientist. And it really sort of reflects in the the ethos, almost the philosophy of data science in a very simple and easily understandable theorem and approach. So basically, the inference, it really starts with a model or hypotheses about a particular data set, and then allows you to update that model as more data comes in. So basically, you've got this lovely scenario, it's almost well beyond its time of almost a streaming data set, where you've got, you've got streaming data coming into your model, as the data comes in, you take each batch or each data point, and you change your model in response to that data and update your model to be hopefully more realistic, more relevant to the data that you're actually actually seeing. So it's this whole concept of sort of model and updating in response to the data you're seeing. So really, super relevant to data science, I think.

Jason: 1:59

And crucially, depends on the probabilistic modelling that we're bringing in here. So what's the probability of something happening, let's say, and if I have another data point that indicates it's likely to happen, I'm going to come up with a higher probability that yeah, it's more likely to happen, because I've now learned an additional piece of information that makes me change my mind, or if it's a model makes the model head towards the actual truth.

Jeremy: 2:26

Yeah, all these models that we've been talking about are statistical, they're formed around probabilities. And as a result, the probability of seeing a particular outcome or output from your model is just that, you know, it's between zero and one, it gives you a level of confidence, maybe that you're, you're seeing something that is very likely or, or is very unlikely to happen. Again, it typifies the sort of the data scientist experience, which is that these things are rarely given in true and false. On Off state, they are more often than not outputs that are that are probabilistic that are, you know, have a have a measure of uncertainty about them.

Jason: 3:10

Yeah, as every model does, it's never going to be a perfect representation of what can happen in the world. But you want to have a high confidence in the model, and you want to have an answer that gives you a high probability of something happening, and then you can react to that. And I guess if your model is flipping a coin, it kind of doesn't matter. Having a model at all, what difference does it make if your coin has come up heads or tails, pick an up outcome.

Jeremy: 3:39

But your model there might be that it's that it's unbiased initially. So it's equal and therefore really simple. But, you know, if, if the data starts to come in, you know, we've all had situations where you start flipping a coin you go, hang on a minute, I've had 10 heads in a row. what's what's what's up with this coin? Maybe it's not, maybe it's not a fair coin. Maybe it's a biased coin.

Jason: 4:00

Yeah. And this is where it gets really confusing, actually. Because at any one instance of flipping the coin, if it's not a biassed coin, then your chance of picking heads or tails doesn't change, right? You know, just because you've had nine out of 10 heads, and you're going to flip it that 10th time, you don't have a higher likelihood of it being heads if it's an unbiased coin, but it's really confusing because what I learned a base when I first picked up the book on probability was the Monty Hall problem. That if people have heard of it, it's a really kind of fun anecdote base, and how probabilities can affect your decision making. Have you heard of it?

Jeremy: 4:41

It's all goats and cars to me?

Jason: 4:42

Yeah. I think the premise is you're in the game show and there's three doors presented to you and behind one of the doors is the winning car as the prize right and the other two doors is a goat! You're going to be travelling home on the goat. Okay. They asked you to pick a door and you pick a door, they on't open the door. But they ay hold up, we're going to op n one of the other doors. Oka, and show you how you picke. Let's say you pick door numb r one, how do you pick doo? Number two? It turns out th t would have been a goat. Now, o you want to stick with yo r original choice of door? N mber one? or change your mind? nd pick door

Jeremy: 5:19

Okay. And everyone always says, You shouldn't change, you should just why would it? Why would it matter? That's what I think I think that would be the popular decision, in this wouldn't it?

Jason: 5:28

Why? Why would it matter? You just now have two doors instead of three. So attitude or is 50:50 now or race instead of three chances? But no, it's really interesting, because you still have the information that you had when there was three doors. But because you picked one. And they've now shown you one of the others was a goat. The probability of the other door being a car is actually now two thirds instead of the one third.

Jeremy: 5:55

So you've had to update your model, basically, in the presence of a data point, which has come in which was one of the doors being openned, exactly, and shown to be goat, right?

Jason: 6:04

Yeah. And it's really counterintuitive. But when you work it out, if you draw situations and scenarios on paper, you can see why that's the case. But it's really telling, and just as a way to explain how important retaining your new information is, when you go forward with your next decision.

Jeremy: 6:24

Yeah, that's really interesting. I think it feeds into a lot of what we've talked about previously, which is your data. So it should be used for decisions, it should be used for actual outcomes, to drive, you know, an event an impact, which you want to have, as a result of a result of the data science. So in that sense, it's really, really helpful.

Jason: 6:46

Yeah, and the mathematics works as we see, whereas this applies in the world of data science. And one of the examples we talked about with spam, email detection.

Jeremy: 6:55

Yeah, there's a there's a lot of companies doing this sort of, streamed work, really, I mean, email you can think of as a stream into an organisation. And it turns out that there's a lot of nefarious actors out there who are trying to get people to take, you know, maybe bad decisions or poor decisions based on the email that they're getting. I mean, this is no surprise to anybody who gets a tonne of email, like you and I do. You know, it used to be financial scheme or, or other. But, you know, it's become more nuanced. There's, there's different categories of malicious email, there's sort of things called spear phishing attacks, where someone pretends to be your boss and says, Oh, I'm locked out of the office, and I can't get to my computer, would you mind sending me that key report, you know, that sensitive piece of data that I've asked for that I know, the employment record of a colleague or something, would you mind sending that to me, then I'd be really grateful. And if they fake the email address in a sort of sufficiently clever ways, sometimes even spoof it to the email server, then it can look really, really authentic. And they can get quite a lot of information out of some poor, unsuspecting employee in the company. And then, you know, obviously bad things happen as a result. Yeah. So the outcome there is, can we use our Bayesian tools to update our belief about a given email and say, well, I've seen these features in this email. And as a result, I now believe that this may be a spear phishing attack, or it may be some other type of spam that I'm trying to prevent. And maybe I ask my users, you know, occasionally, can you just tell me whether that was a good email or a bad email? Can you tell me whether it's a spam or not? And then I'm starting to get data in the form of new emails, I'm starting to get corroborating classification from my users that's between those datasets. I'm starting to update my belief model about what an actual spam email looks like in the context of trying to prevent this sort of thing.

Jason: 9:03

So every time I hear report junk, I'm actually labelling a data point.

Jeremy: 9:07

Yeah,

Jason: 9:08

Yeah,

Jeremy: 9:08

You're doing. You're doing everyone a really good service there Jason.

Jason: 9:12

Yeah. And we've seen this as well in medical testing. So what's really kind of innocuous about the spam email example is, it doesn't really matter if a spam gets through to my inbox, I probably recognise it as a human as something that's a phishing attempt, you know, maybe the likelihood of me sending somebody my credit card is low if I'm questioning if I'm a cautious user of my emails, but in the medical realm, we kind of have the same setup, but you can run a test for a medical process, but you can also then see whether you have what's called false positives or false negatives in your results. And there can be a more important side effect there.

Jeremy: 10:03

Absolutely. And I think introduces a number of really useful concepts from, again, the data science, statistical perspective and way of thinking about problems. I mean, you know, we're all, you know, getting very familiar with types of COVID tests at the moment. And the implications, obviously, of having a positive or a negative test in any of these sort of regimes are quite serious, they are quite impacting on an individual. So you know, it really does matter you there are, in the UK, at the moment, there's a programme to roll out a type of COVID test called lateral flow test to all of the secondary schools that are going to be then putting these these tests in place, and that, you know, weekly or bi weekly basis. And this is a test, which has a good true positive rate. But it also has quite a high false positive rate if you especially if you're asymptomatic, apparently.

Jason: 11:04

So I wanted to set that up a bit, actually, because what you're about to get into is always really potentially confusing, because when we think of a medical test, there's supposed to be only two answers, which is either a test positive or negative, but kind of there's at least four answers four main answers, because it depends also on me whether I have the condition or not. So you are testing me as somebody who doesn't have COVID. And I can either have a positive or negative result. But then if you test somebody who does have COVID, either positive or negative results, so you get your different combination of possible results. And that's where the rates come in. With regards to specificity or sensitivity of your test, I always find these confusing in my classification matrix. But if we have somebody who has COVID, and has a positive test result, then that's a true positive. And that's a high sensitivity, hopefully. Yep. But you're talking about this test that maybe doesn't have a high sensitivity.

Jeremy: 12:10

Yeah, it's, I think it's something like I read on the government website. It's something like 75% sensitive in the case where you have just anybody with COVID. So you don't know if they've got very high propensity to infect a very high viral load or not. So it's better in the good news is is much better. If people have a high viral load, and they're, they're infecting loads of other people, then you find out very quickly, which you sort of hope...

Jason: 12:35

It's easier to detect.

Jeremy: 12:36

Yeah, right. Right, exactly. And so you're absolutely right, you've come across this concept of the rather aptly named sort of confusion matrix very quickly, which is these four states that are things that you just outlined there of, you know, true positives and false positives. And you would have to sort of take a step back and hang on a minute, which ones that every time every time,

Jason: 12:54

Right now on the podcast. After practising, I'm still doing my head.

Jeremy: 13:00

Yeah, me too. So for these particular tests, these lateral flow tests, you've got this true positive rate. But obviously, if you hit a false positive, that means that someone's going to be potentially isolating for you know, 10 days, two weeks, something like that, that's quite a high life impact for what is an error in the test, I suppose you could say, the recommendation in some countries, and in some situations is that you should go Okay, so you take lateral flow test, which is nice and easy, you can do it at home comes up in answer in 20 minutes, half an hour or something. And then if you come up with a positive, you should then go on to one of the more precise tests something with a higher sensitivity. So that would be the PCR test, I think that has a better better outcome, but is more expensive, and it takes longer to get the result. So there's a lot of what you find in these testing situations is there's a lot of context around the test that you have to take into account. And you have to you have to become epidemiological expert, or mostly in the test to really get to the bottom of this. And in the context of a school running the sequence of tests are having all of these test results come in, because you're supposed to report whatever the result is true or false. Once you've taken them, you want to know, right? I've got a model for my current belief that we have a COVID outbreak in my school, that would be really important to know, given this this stream of data that's constantly being updated. And that's where I think this Bayesian approach and mindset can be super helpful.

Jason: 14:24

Yeah, and more than that with Bayesian and thinking we could have more information to bring in about who it is you're testing. I was listening to a report a beta recently saying that there had been some links to obesity, but they don't want that to overtake the really important one, which is age. And if you're in certain age brackets, that's where you're more likely to have a bad reaction to COVID.

Jeremy: 14:52

Yeah, so they've noticed these correspondences, that's very much a probabilistic association of if you have COVID and you have one of these features if you like, if you if you are older or you are...

Jason: 15:07

Age isn't a condition, right, is gonna say,

Jeremy: 15:11

Not a lot you can do about that as I as I know, then then you know, you have a higher probability of it turns out not so good for you. That's a, I think, quite a useful way of thinking about this be quite good actually to talk about the sort of the actors in this formula in this way of constructing this the Bayesian world in the context, maybe of your COVID testing. So I mean, you've got the notion then of, and this is why it's so important, I think, to data science, you have the notion of I have a model, I have a belief about whether I have this infection personally, or in my school. And I have lots of data that is coming in on a daily basis that I'm using to inform. So the outcome of that process is what we're interested in, it's what is my update to my current understanding of whether I have COVID. So that's called my posterior distribution. That's my posterior. So in that context, and that's, that's made up of a few other actors in this, which is the likelihood that I have COVID, at all. So that would be the probability of having that set of data that I have that set of tests, given the belief that I have in whether I have COVID, or whether my school has COVID. So that's what that's a likelihood agent in this process. And then you've got your prior, which is your which is where you were before you started this whole thing. It's like, well, do I believe I have COVID? or not, you know, maybe there are some symptoms going on you so you have an internal suspicion that you have, you have COVID, or you don't have COVID. So that's your prior, and that will get fed into this machine to give you your output? And then finally you have, what's the probability that I was going to get those test results anyway, just randomly or, you know, looking at the whole population, I suppose, what's my estimate that I would have that particular sequence of test results at all, given the nature of the test, and that's where you have to understand the test so carefully, and that that's your marginal likelihood. So there's lots of these elements that go into it. And you then you stick them all together, you end up with what I said at the beginning, which is this posterior probability that you have COVID are based on the data. And the power is, it allows you to update that incrementally, almost on the stream basis is you get new data, you can update your belief about that outcome.

Jason: 17:29

And right at the individual level, I, before I ever heard of COVID, would never have reason to think that I have it. That's my prior. And then I start to learn what COVID is. And suddenly, maybe I've got a tickle in my throat or a bit of a cough or losing my sense of smell. And I want data now I go and get a test. And that test will either tell me whether I do or don't have COVID. But it's not actually 100% guaranteed that the test is accurate. So I update my posterior my belief now is, oh, I very well might have COVID, because I've gotten a positive test result from this quick and easy test, that's good to roll out. But I'm going to get a second one. Because again, more data means I can update my understanding of whether I do or don't have COVID. And be even more convinced that the result is true. Just at an individual level.

Jeremy: 18:24

Yeah, absolutely. That's quite nice example, because even those suspicions that you had even those sort of expression of symptoms, counters data in this model, you know, you start off from the point of view not I definitely don't. And then suddenly you get a particularly cough or you you lose your sense of smell. And that's data, maybe not data that you write down or tell anyone about but is it still data that that you're aware of. And so that starts to update your belief model. And then you add in the test as well. And that's that's more data and that's more persuasive, maybe.

Jason: 18:53

Nice. And you had a interview with somebody who knows a lot about this from especially an algorithmic point of view, Dr. Joseph Walmswell, a principal data scientist at Abcam. Let's hear what he told us.

Jeremy: 19:09

I joined in the DataCafe today by Dr. Joseph Walmswell, who's principal data scientist at life sciences company, Abcam. Welcome, Joseph.

Joseph: 19:19

Thank you, Jeremy.

Jeremy: 19:20

And we just had a really interesting talk from you today, around the area of Bayesian inference and all that goes with that. And I really wanted to start with how, in your view, Bayesian inference and Bayes formula really relate to data science from your experience.

Joseph: 19:44

Well, I suppose I'd start by saying it doesn't relate enough. So there's quite a divide still between people with a say a Mathematical Statistics background and then practicing data scientist That's understandable given that there are a lot of data science methods that aren't really mathematical at all, like the random forests, which are very effective. And if you've got a toolbox of very effective methods, why would you want to be, as you might see as unnecessarily constraining ourselves by constructing parameterized models, and that's fair enough, but you then are faced with with a serious difficulty when, firstly, when you need to construct a parameterized model when the parameters are important, rather than just the ability to make a prediction. So an example of this might be if you were say, an ecommerce company, and you're trying to understand what drew people to your website, and then what, what actions caused people to buy things. Now, being able to predict something is one thing that you you want to understand that the causal structure of what's going on underneath. So there a model rather than a black box can be useful. Then, the other point where I think data science can learn something from Bayesian statistics is in understanding that that knowledge is effectively probabilistic. So you might set up your neural net, for example to to classify an image as it might be, and then outcomes, your result, it's a cat or it's a dog. But that's not really what they are leaving your your black boxes capable of doing, it will think with some probability that it's, it's a dog or a cat. And there understanding that and then understanding what sort of probability distribution is really going on, is important. So to be more specific, say you're trying to do forecasting with a neural net. Yeah, if you're forecasting say something that's fairly big numbers over a fixed order of magnitude, then the standard neural net approach of trying to optimise for mean squared error will probably work quite well. But if you're trying to forecast, say, small numbers, so as it might be sales of a product line that does doesn't move very quickly, that might sell, say, one unit this week and zero units next week and say, three other week, then using in spread or naively in your on your own, that is probably going to give you worse results than if you thought well, this is effectively a person process, my neural net behind the scenes is going to take in all the information I know and then come up with some best guess at what's going on. But the way I relate that best guess to the different outcomes. So if I wanted, for example, to calculate, how likely is it is that I will sell, say, one or two units, I should make my calculations on the assumption that I've got my Poisson mean, and then I can use the Poisson distribution to do that calculation. So I suppose it boils down to the data science often does need Bayesian methods without realising it.

Jeremy: 23:01

Yes. I mean, the thing that strikes me about Bayes just as a sort of philosophy, as much as as much as a tool, at least initially is that you've you've got this idea of a core that represents the model I'm creating as informed by the data that I've, I have access to, and I and as a data scientist, you know, we, we like to think we can create sort of beautifully general models, but really, we only have access to the data that we're given. And that's really all we have to go on until we get more data until we discover more knowledge about the system. So, you know, in the neural net example, you can, the neural nets only really as good as the training data you've historically given it to be able to tune those parameters and get it into a trained state. But if you then expose it to more training data, it might be it might get better, it might become over fitted, it might, it would change state, but it's all dependent on that data. And I l ke Bayes from the perspective f it being dependent on on the ata that's explicitly sort of n there, right at the heart of t. Is that is that something ou've taken advantage of?

Joseph: 24:10

Yes, yes, I agree with that. And then I'd add also t e Bayesian reasoning is, well s human reasoning. It is how o r brains actually work. We have a prior belief about a situatio, we get some data, we update i when we have have a new belief based on combining the two. And here's a nice example of how this practical Bayesian easoning intersected with what appeared at least to be a very ffective blackbox neural net lgorithm right back in the 80s, when the Department of Defen e in the United States funded project at a particular American University to build a n ural net that would take image of East German forests and the predict whether or not there's a Warsaw Pact tank column in it And the idea is that this lgorithm could be loade into An automatic camera mou ted on a NATO tank that coul be done scanning the surroun ings all the time. And w've been identified various possible hazards to the tank c mmander. And they were very happy to begin with when this achieved 100% accuracy. And t ey, they they did it all very w ll, they had a specific tes set set aside and they were ge ting 100% accuracy on the test set. And the Bayesian br in would probably say somethin like, Well, my prior belief tha the effectiveness of this c assifier is such that 100% accur cy is just not highly plausibl at all, I just don't beli ve that my prior probabili y is that there is a certain possibility that the algorit m is doing something wrong s mewhere. And I don't know what t is that that's why it's b ing accurate. And it turned o t that what happened was the pies he provided the training ata has photographed the Ge man forests, without tanks on a sunny day, and the German f rest with tanks on a cloudy d y. So all the neural net was rea ly doing was telling you

Jeremy: 26:08

Brilliant! Yes, of course. So just spotting, spotting the light conditions in the photo and going, there must be a tank or there isn't a tank.

Joseph: 26:19

Yeah.

Jeremy: 26:20

I like that. And then I think you alluded to it there, you've got this, this idea in Bayes of it being there being a prior model, a prior sort of belief about the world about the data set, that you're all the problem, you're considering it, which is informed by the data set. And then, you know, sort of Bayes nicely provides you this other update mechanism for saying, right, well, I had that prior model. That was my that was my belief. This is what I believe to be true, I believe there was a tank in that forest. But now I'm being given more data. And now I can update that say, well, there's only a tank when when the sun's out, well, there's only there's been tanks happen, or there's only there's only a tank when I can see a metal metal glinting maybe in the photograph.

Joseph: 27:07

Yes, that is the great charm of Bayesian inference that your state of knowledge is captured by your posterior once you have that you can then disregard how you came to that state of knowledge. So you don't need to store all your previous data points. When you rerun your model, you just store your posterior. Of course, in practice, that's easier said than done. If your posterior is in the form of a bunch of samples from Monte Carlo methods, rather than a function. If that's the case, then starting the inferential process, using that as a prior is not easy, you'd have to put some sort of kernel density estimator here, if it's possible that you might be better off running it on the entire previous data set. There's a lot of interesting work there about filtering samples and trying to approximate it a prior based on a sample posterior.

Jeremy: 28:03

I think your reference then to essentially how human beings really update their belief, and they do it based on their observations that sensing of their environment. I think that's a really nice analogy. And probably why I guess Bayes has been such a popular go to then for scientists, especially now data scientists over the last two or three decades.

Joseph: 28:26

It is yes. And I think what we'll see is more of a melding of the traditional statistician with the data scientists. So there are people who run Bayesian neural nets, for example, where you have not merely an output function that's probabilistic for the parameters themselves that make up the neural net conceived of probabilistically rather than just optimise to optimise the output. Amazon have a very good forecasting package that runs on their Sagemaker platform where you can set the output probability distribution to get a great variety of things. So you could if you're doing dealing with count data, you could use the Poisson distribution, if you're dealing with over dispersed count data, you could use the negative binomial.

Jeremy: 29:17

So where do you see this going? You mentioned a couple of techniques earlier around kernel estimation. What's the sort of the next step for someone really wanting to get into Bayesian inference and use this in a in an exciting way in their in their work?

Joseph: 29:31

Well, one thing we haven't mentioned at all is the problem of model choice. So Bayes. The Bayes theorem applied to parameter estimation comes with the notion that the chosen model is it might be linear regression is your given for everything so the probability if the data is probability of the data given the model. The likelihood Yes, the prior is the probability even if the parameters given model. And even for something like linear regression, you might have the choice between fitting with a straight line or fitting with a quadratic, and the quadratic would probably give you a better fit under most circumstances, because you've got one more free parameter to play with. But that doesn't necessarily mean it's the best. The best model now, this is where data science can can help. Because the form of Bayesian approaches, you're well aware, Jeremy is that you calculate the model evidence, here are two different situations, you calculate the probability of the data given the model by integrating the posterior and then you use the fact that the probability of the data given the model is proportional to the probability of the model given the data. Now integrating the posterior is even harder than sampling from it. And there are those some some interesting ways to do that. So you could take an end run around the problem by modifying your Monte Carlo sampling process to jump between different models, for example, they're different parameterizations. And if the two parameterizations are not so different, that the likelihood is so different then a jump will have some probability of being accepted. I did this for my PhD. At one point, it was about whether looking at star clusters and deciding how many different age populations were there. So it was a question of, the right model, as well as the right parameters, how many clusters, how many populations as well as how old they were. So it's quite interesting. But tricky. It's very tricky to tune properly. Whereas the data scientist would say, at this point that you're just over complicating it, you just have your, your testing data set, you measure your model accuracy based on that, and then you pick your best parameterization b ased on that, and most of the time, I'd agree with this.

Jeremy: 31:58

So if in doubt, keep it simple. Yeah. Seems like a nice, nice mantra to attack most data science problems. Excellent. Joseph, thank you very much for joining us in the DataCafe today. That was really exciting.

Joseph: 32:10

Thank you, Jeremy.

Jason: 32:15

Joseph said something really cool in that interview about how Bayesian reasoning is human reasoning. And it really stood out to me, because it's actually what we were talking about earlier on this, I bring in my own data. And as a human, I respond to my environment by gathering data. My senses are what's gathering the data, but then we're trying to translate that way of reasoning into a theorem into logic into algorithms, and then apply it and test it in the real world for these examples, like the tank that he talked about.

Jeremy: 32:49

Yeah, as I said, it's sort of a theorem that's almost beyond comes out of the future, from Thomas Bayes perspective, because it's it speaks of having this constant stream of data that you're able to process and then update your model, your algorithm your decision based on data that you're seeing. And that's exactly the sort of architecture that you get in modern machine learning models. We're maybe being fed by updating datasets, or sort of user clicks from a website, or whatever it is, that's feeding your feature set. So it chimes with the human process of learning and adapting from a child right through to an adult. Yeah. And it also, I think, works from the perspective of modern day modelling and data science tooling, almost.

Jason: 33:46

Yeah, that's actually where when we were growing up a child plays for the reason of experience in interacting with the environment. And when we build our models, we talk about running them in a sandbox, or playing in the sandbox, scientific rate, what's the use for that for the model? And what data do we need to add? Or how do we tweak or fine tune it?

Jeremy: 34:06

So one of the things that occurred to me when talking with Joseph was how he talks about the problem he had in selecting the model from the perspective of a Bayesian sort of way of approaching a problem. I thought that was quite a nice sort of piece of honesty, almost, from Joseph, because what you have when you're when you're constructing a Bayesian model, as a data scientist, is you have this decision to take. And it's not just a decision around a set of parameters. It's a decision about what model should I apply in the first place? Should it be a Poisson model or a binomial model? Or should it be normally distributed, or gamma distributed or something like that, you'll see but you've got all of these many, many possible models to choose from, whereas what he said was a data scientist would say Oh, well, I'll just throw a random forest at it, or I just throw a neural network at the problem. And I'll get it to learn the pattern that is emerging from the data that way. So I liked that. But I, it occurred to me that even something simple like regression has a sort of Bayesian element to it.

Jason: 35:17

Yeah, and even before we would get into complicated models, you can see it when we apply linear regression, that you have a certain stability to the model based on the current dataset that you have, you can update it with more data, you can add more data points, and you can then refit it, and you get a new updated version of that model. And so in the case of linear regression, maybe you're classifying a trend in the data. And maybe that trend has shifted, because of some unknown, or maybe there's some reason to go and investigate what that unknown is if something has caused a shift in your data. And I think Joseph also talked about the effect of outliers, and whether you need to account for them. If an outlier is going to dramatically shift your model, maybe it wasn't stable in the first place. And you need to look at the distribution in your beliefs, and look at how stable the model is based on how much data you have, or whether the outlier is actually really interesting, and got to figure out what's causing it.

Jeremy: 36:17

That's a really good shout. And sometimes the outlier is an artefact of the collection process. Or sometimes it's an artefact of the sensor, or it just may be, you have got mangled along the way, who knows. But it can be a nice way of picking up that kind of thing, again, coming back to what Joseph was pointing towards, which is that sometimes you, you can get a better fit from a more complicated model. But that may not be what you want, you might actually want to be in the constraints of a slightly simpler model in order to cut through that kind of noisy data situation, because otherwise, very classically, you get an over fitted model that isn't going to be a good predictor for anything in the

Jason: 36:56

And I think there was some modelling that happened future. around the trend of COVID. And when you added him exactly what Joseph said, another variable so that you can make it a quadratic. But if you extend that into the future, and treat it as if it's a forecast, you see the effect of that quadratic fly off in one of the directions that you don't have any data for, it's not constrained. And it's no longer valid, you can't be using this as a forecasting tool, just because it fit really well in the interval that you did have data.

Jeremy: 37:31

Yeah, interesting. I think there are lots of pitfalls to using a statistical model where you have to have an understanding, I guess, of the underlying dynamics sometimes of what you're looking at, to be able to make some of those initial modelling decisions. But when you do have that understanding, when you do have that training, then it's enormously powerful. And it can be a tremendous benefit to the data scientists to have that level of insight. And that and that experience, which if you just use a sort of Machine Learning Toolkit, where you have maybe a blackbox neural network, typically that you're throwing at the problem, maybe that doesn't come through, and you run a much greater risk of accepting data points as legitimate, and as affecting your output function. And your output classification, if that's what you're doing, as a result of not having that greater, greater depth. So that balance, I guess, between the simplicity of doing something when you say I'm deliberately not going to try to understand this system, I'm just going to throw a box at it, versus the extra insight and understanding and depth that you can bring, when you say I got a very, very strong hunch that this is a Poisson distributed process. And I'm going to base my modelling on that. And that that gives me probably a much more convergent, accurate process much more quickly.

Jason: 38:58

Something else that Joseph mentioned, I'll ask you, how should we bring more of this way of thinking into data science, and I guess, overcome or see, where's the benefit versus the situation you just mentioned about taking something off the shelf, which is valid in many cases, I just want to see, you know, the usefulness on the current static dataset. And there does need to be a bigger understanding of what's going on because I've got quite a self contained problem. Let's say it doesn't have a medical outcome, like we talked about with the COVID testing.

Jeremy: 39:33

I think the power from the data science perspective and the way of thinking about a problem it comes from when you're using this Bayesian update inference rule, if you like, comes from being able to recognise the fact that your data is not static. Typically, or very rarely. It's very rare that you're given a problem we say here's a here's a body of knowledge, and it's never going to change again, we just want to know, should we go left? Should we go right? Should we spend a million dollars? Should we spend$100 million. And that's it, it's more often the case that you're given situations where the data changes that what was true yesterday may not be true tomorrow, because the data has shifted, and maybe the model has shifted as well. And that, I guess, is where things can get quite quite challenging. From the perspective of a Bayesian model, if you were assuming it was one type of model, and then the very essence of that has modified. So it's really embodies the concept of change with respect to data, and especially drastic change, which can affect you know, which we've seen anyway, with the pandemic, of late in demand figures going haywire, and all kinds of sort of societal behaviour changing dramatically, where you see that you have to be super careful with that coming into any model be Bayesian or otherwise as to how that's going to affect the future operation of that model. Maybe you want to sandbox it a little bit, maybe you want to put a mark around that data set and say, I wouldn't put too much belief in this, this set of data if I were you because the probability that we have a pandemic is, I hope, really small usually. So it speaks to a lot of that kind of dynamic streaming of data and how you react to change in that. So yeah, I like it from that perspective.

Jason: 41:35

That's really cool. And thanks, Jeremy. I think hopefully, some people listening today will have updated their own ideas about what Bayesian inference is, and come away with a new idea for how it can be employed in data science.

Jeremy: 41:47

Thanks Jason.

Jason: 41:49

Thanks for joining us today at the DataCafe. You can like and review this on iTunes or your preferred podcast provider. Or if you'd like to get in touch, you can email us Jason at datacafe.uk or Jeremy at datacafe.uk or on Twitter at datacafepodcast. We'd love to hear your suggestions for future episodes.