Apple Tasting: Reinforcement learning for quality control

February 22, 2021 DataCafé Season 1 Episode 11
Apple Tasting: Reinforcement learning for quality control
Show Notes Transcript Chapter Markers

Have you ever come home from the supermarket to discover one of the apples you bought is rotten? It's likely your trust for that grocer was diminished, or you might stop buying that particular brand of apples altogether.

In this episode, we discuss how the quality controls in a production line need to use smart sampling methods in order to avoid sending bad products to the customer, which could ruin the reputation of both the brand and seller.

To do this we describe a thought experiment called Apple Tasting. This allows us to demonstrate the concepts of regret and reward in a sampling process, giving rise to the use of Contextual Bandit Algorithms.  Contextual Bandits come from the field of Reinforcement Learning which is a form of Machine Learning where an agent performs an action and tries to maximise the cumulative reward from its environment over time. Standard bandit algorithms  simply choose between a number of actions and measure the reward in order to determine the average reward of each action. But a Contextual Bandit also uses information from its environment to inform both the likely reward and regret of subsequent actions. This is particularly useful in personalised product recommendation engines where the bandit algorithm is given some contextual information about the user.

Back to Apple Tasting and product quality control. The contextual bandit in this scenario, consumes a signal from a benign test that is indicative, but not conclusive, of there being a fault and then makes the decision to perform a more in-depth test or not. So the answer for when you should discard or test your product depends on the relative costs of making the right decision (reward) or wrong decision (regret) and how your experience of the environment affected these in the past.

We speak with Prof. David Leslie about how this logic can be applied to any manufacturing pipeline where there is a downside risk of not quality checking the product but a cost in a false positive detection of a bad product.

Other areas of application include:

  • Anomalous behaviour in a jet engine e.g. low fuel efficiency, which could be nothing or could be serious, so it might be worth taking the plane in for repair.
  • Changepoints in network data time series - does it mean there’s a fault on the line or does it mean the next series of The Queen’s Gambit has just been released? Should we send an engineer out?

With interview guest David Leslie, Professor of Statistical Learning in the Department of Mathematics and Statistics at Lancaster University.

Further Reading

Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning th

Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.

Jason  0:00 
Hello, and welcome to the DataCafé. I'm Jason. 

Jeremy  0:04  
And I'm Jeremy. And today we're talking about Apple tasting.

Jason  0:10  
Apple tasting, we're gonna have to give a bit more context for this one, I think, what are we talking about with Apple tasting? Jeremy? 

Jeremy  0:17  
Well, context is everything as we'll find out. Yeah, I mean, this is a bit of fun. This is a scenario where you have a conveyor belt of apples going in front of you, maybe you're a grocer who's decided to really major on the, the apple market. And you realise that you need to make a fantastic reputation for yourself as a purveyor of extremely high quality products and apples. And so you're wanting to weed out rotten apples on your conveyor belt, and get only really perfect apples to the shots, right? So how would you do this, you're employing a tester. And for the sake of argument, this is just, this is a person who's looking at the apples. And as they go past is going, I think that's fine, I think that's fine. And then suddenly spots one, which maybe has some blemish or something, and then they go, that might mean that the Apple is not not as good as it might be 

Jason  1:21  
A bad apple.

And this person is looking at them manually. But we've said tasting, we implying that they will have to actually pick up the apple and taste it. That's the sampling process that we're indicating. 

Jeremy  1:34  
Right. So here's the the fun part, if you like which is that in the thought experiment, which is, which is Apple tasting, the tester has to taste the apple, right? If they're suspicious, that it might be about Apple, so they pick it up, they taste it, and they can tell for certain once they've tasted it if it's a bad apple or not. But that comes at a price, right? 

Jason  2:01  

Jeremy  2:02  
I mean, I wouldn't want to buy an apple that have been tasted by somebody else. That's really what we're saying, 

Jason  2:08  
Yeah, they're not gonna pick it up, taste it and say, That's a lovely Apple, the customer looks forward to that one, 

Jeremy  2:14  
...put it straight back in the box off to the shops with you. 

Jason  2:17  
So they picked it up and tasted it and found that it was a bad Apple. So they're glad that they have discovered that and discard it gets, and it doesn't get to the customer. So they're doing us a favour in that regard, protecting the customer's interests, which is the focus here. Yeah, 

Jeremy  2:33  
Yeah, they're doing us a favour when they get that decision, right. And they're causing some level of harm, I suppose to the grocery business, if they get the decision wrong, they pick it up, they discover it's an absolutely delicious apple, they can't very well put it back on the conveyor belt, they can't send it to the shops anymore. That's now a wasted product, that's a product which can no longer be sold. So you've got this sort of Jeopardy going on in this in this hypothesis, you've got some it's got some interesting information, which could be used to indicate a possible problem. And then you've got this Jeopardy, which is I can taste the apple and find out if it's really true, it's really bad. Or I can let the apple go by and risk of really bad apple going to the customer and wrecking our reputation. 

Jason  3:23  
I often get reminded of Schroedinger's cat with these situations, if you want to check that it's in the box, you've ruined all hope that it was alive if you discovered that it was dead. And in this case, the Apple is dead in some manner of decay. It's always the way I kind of remember from college framing the quantum principle, but we're right at the macro level here. And we're interacting with the customer's product. So we've come across some constraints here. And really the question is, how is this going to be relevant to data science and the setup that we're getting at? 

Jeremy  3:58  
I love the idea of shooting as Apple. That's really nice. Yeah, absolutely. So what is this at heart because I think the data science of Apple tasting per se is probably not got that huge, wider appeal. But actually, this is a really nice statistical testing problem. That is something which a lot of data scientists, I think will come across at some point, maybe if they're working, especially in operational based industries, or in advertising, online advertising or anything like that. It's really something where you've got this, this observation of a product of an item of an interaction with a with a user, which is indicative. It's an indication of there being some value or some cost in the case of it being about Apple in the data, and then a further test which can uncover the real truth. But has a high cost or higher cost associated with it? 

Jason  5:04  

Jeremy  5:04  
And that's, that's the, the setup here. Is this this indicative information and then some kind of higher cost outcome as a result of a result of the test? 

Jason  5:15  

Jeremy  5:15  
So I think that's, that's what makes it interesting. From a data science perspective. 

Jason  5:19  
Yeah, it's interesting to put the cost on the sample at the sample level, as opposed to the cost of an experimental setup in the first place. Because as a data scientist, it doesn't really cost me anything to look at my data. But it's how that data flows. That gives me a context that puts a cast on it. 

Jeremy  5:39  
And then, crucially, what decision Do you take, fully rounded data scientist goes, well, what's the decision I should take based on the outcome of this experiment? Or many experiments? Of course, now I have an idea of what decision I could take. What's my optimal strategy for taking decisions in this context, to enable me to maximise some some value, of course. 

Jason  6:03  
And you had a interview with Professor David Leslie, a professor of statistical learning in the Department of Mathematics and Statistics at Lancaster University, I think he sets it up really nicely for us to hear.

Jeremy  6:19  
Oh, I'm joined in the DataCafé today by Professor David Leslie. David, welcome. 

David  6:23  
Thank you. 

Jeremy  6:24  
So we just had lovely talk from you about contextual bandits, and Thompson sampling and all these magnificent techniques. And I think the first thing I really like to sort of focus on is a point you made right at the start, where you emphasised how important it was from a statistics perspective, to focus on the decisions that are being taken by a technique and how that really drives everything in terms of the importance of the approach. So is there a particular example where that's really proved to be true for you.

David  7:00  
What we see a lot is that the stats community will focus a huge amount of effort in getting the best possible estimate of the truth, when actually it will make no difference at all, to the following decision. And so you might care is my parameter one or 1.1. But actually, if the decision that you make as a result of that doesn't change at all, there's no point in putting your effort into the influence, that you need to focus much more on moving on to the next problem where you can actually find out something it changes, what will happen? Yeah, I think that's something I've noticed in working in industry is that, you know, there's this desire to make things absolutely fantastic, but actually better, is often quite, quite good and desired. And if you can say 20%, or 30% improvement, and that's, that's a, that's a substantial win, it might not be the 42 and a half percent, that would be optimal. But it's still more than good enough. So I've certainly seen that we all see the city in tuning machine learning models, okay. So if we, we can spend a lot of time trying to find the correct parameter for tuning your model. But really, the performance surface near that near the best parameters is very flat, almost all the time, you very quickly get to a region near the optimal. And then you can spend huge amounts of time trying to get to the best parameter, and you get essentially no difference in performance. That's why it's so hard to tune it perfectly, because there's very little signal. And so you're wasting a huge amount of effort for no real benefit in the performance of your all. You get a disconnect between the people doing the tuning and the people using the system at the other end. 

Jeremy  8:46  
Yeah. So one of the frameworks that you talk to us about in your talk was was contextual bandits, which is essentially a framework for encapsulating the the algorithm and the decisions, you want to just give us a little bit of a an insight into how contextual bandits can be super helpful in this. 

David  9:05  
Okay, so I'll start with non contextual bandits, where you don't get any signal. And so you sequentially take decisions one after the other. And you've always got the same set of things you could choose from. But the there's no difference from one time step to the next in terms of the award for different actions. So if you were advertising, you will be taking new information but who you're advertising to, you just create an advert and then you push out an advert and you create an advert and you assume you get the same reward irrespective of who you're showing that advert to. And that's crazy.

If you go to a contextual advertising setting, well then you get some information but you're showing the advert to and then you show the advert and it allows you to have estimated rewards of ads

Depending on the characteristics of the person that you're advertising to. And so you can learn personalised advertising strategies by responding to what you see, before you make your action selection. You can also think of our example where you are looking at learning to teach a robot to learn how to act, okay, so the robot can either go left or right on any particular situation. And a non contextual bandit will just tell her what, okay, you can go left or right, learn which one's better. But if you give it a contextual bandit approach, you will let the robot look at what's out there before deciding which direction to go in, and so on. And what the best action to take is, depending on what they see. And so the point about going from a non contextual to contextual bandit is is allows you to react to what you see when you're making the decisions. 

Jeremy  10:57  
Super. And then one of the applications that you then used your contextual bandits in was in an area called Apple tasting, which, for me, spoke to any company that's doing manufacturing, or process production, and needs to test products needs to look at their quality control across their product lines, whatever products they're doing have been produced, it has this really interesting model to it. So would you like to tell us a little bit about the the apple facing setup because I think this is a lovely example. 

David  11:36  
Yeah, so the apple testing is a, it's a version of contextual bandits, where on each time instance, you get a signal, telling you something about the world or the product that you may or may not decide to inspect, you then decide whether or not to pull that one off the production line and look at it. And if you look at it, you can't sell it anymore. So the apple tasting problem is you have to pick up that Apple and bite the apple.

Maybe you you kind of prevent that sale happening to a customer. And so that slows down your revenue. But of course, if you don't look at it, you will never know if it's good or bad, until the product has gone to the customer with potentially bad outcomes if it should have been. And so the challenge here is to make sure you inspect enough of the products to learn a good model of when it is worth interfering in the process of releasing the products to the customer. So if I don't know which signals signal that I got a bad product, then I can do this effectively. And so I need to inspect and learn which ones might be bad, even before I go. Now, of course, the really challenging thing here is this the mapping from signal to outcome changes through time.

You know, all of our theory is set up in a very stationary world. But those mappings are fixed through time.

But of course, if things are fixed through time, you don't really need to monitor them because the worlds okay. And so the real challenge is we're trying to explore in there keeping up when the world changes a bit. So say something's a machine is blinking, can we detect that there's something going wrong in the system? And then you need to make sure you keep exploring things? Did you get a signal, which would have been a good signal of things being okay, in the past, you still need to explore that enough to check that that is still a signal that things were okay. Because otherwise you might start losing, you know, any number of bad products, the market for you ever learn about that fact. So in the context of contextual bandits, so that your signal is your feedback from your your test of these products as you as you select them as you test them, and you see which ones are bad, faulty, defective, whatever and which ones are not, which ones are fine. And then you're learning, you're storing that information, you're encapsulating it in distributions or whatever. And that allows you to then make better decisions next time around. That's it, essentially, yes, so but the signal is not. So we have the signals, which are the context, which are the things which we see anyway, whether or not we make the test. So in the apples, it's the appearance of the apple, you know, is it slightly scuffed a bit of colour, and then we decide whether or not to intervene and take a bite to make a measurement of it is Apple actually a good or bad apple. And then what we're trying to learn is that mapping from appearance to the good or bad label and we need to make sure we take enough with it not at all to learn that mapping. So there has to be there has to be an indication there has to be a hole in the apple or some brown discoloration or something or in the context of a product. There has to be some something that you might be able to observe some from a photo or video or a rattle in the box or something

Which might indicate it's no good. If it looks just the same as every other product, you've got to have a tiny indication at the beginning person, though his earlier work on if all the products look the same to the tester, how many of them? should we bother picking out to work out? Is the system going bad? that's a that's a separate area of research called statistical test control. Yeah, but this bit of work is where you get some signal. Yes. So, example, I gave an a seminar of fault monitoring on a network, okay, you get some indication as to whether or not this is a fault. You know, if we observe some data about the fault that has been flagged up, and we need to decide, is this worth sending this to a human to investigate or not? And so with the information, the context is the current state of the network? And then we need to decide whether or not to ask the human for a label for Yes, that is a poker popping, or no, that's not really important at all, that just happens. 

Jeremy  16:00  
In each of these cases, there is a cost of doing so. So either we've, we've destroyed the product, eaten it, or at least made it, you know, so it's beyond other other humans to be able to, to eat by by biting into the apple, or, alternatively, we've incurred a cost just by sending it for some kind of human inspection. And that itself is costly, because it's, it's more expensive to do that. 

David  16:21  
That's exactly it. Yeah. So the intervention, action cost something, the non intervention cost you nothing unless it's a bad example. But also known intervention costs you data for the future. So you don't get to learn about the things that you sent out and haven't tested, until you can use them for improving your model of what's good and what's bad. And so that's all we'll see in this bandit world, where we have the immediate reward, but also the Okay, do we learn for the future or not? 

Jeremy  16:53  
Mm hmm. This idea of exploration versus exploitation, we know when is it worth me putting in the effort to to look beyond what I know currently, and discover something I didn't previously know that I could later exploit. 

David  17:08  
Exactly. And if we take that back to advertising, if you stop showing an advert forever, you will never learn if that advert is now a useful advert to show because you're not getting any more data about it. So until you've decided to start showing that advert, again, you don't seem in clinical trials, the drugs, that was the first place that many of these methods were described, if you don't try a drug, you never find out if it's any good. Equally, if you get stuck on a drug, and the new ones coming along, but you don't explore them, you'll never learn if there's any better ones. 

Jeremy  17:39  
Yeah, well, I suspect we're all going to, we're going to learn something from a whole set of vaccines in the very near future. So that in a sense will be a nice piece of experimental bandit work that we can

David  17:52  
Actually Jeremy. So the big Oxford clinical trial, there are a couple of trial that was designed one of my colleagues in Lancaster, the stats of it. And he was using adaptive trials methodology. So that as methods were coming in, they were focusing their efforts on those treatments that were working best, or that there was most to learn about quickly, and then throwing out treatments when they were no good. And so that was the method that they use to really, really quickly find the first effective treatment for very ill COVID patients, and quickly threw out the Donald Trump's treatments, shall we call them? The ones which were never going to be useful, but the trial very quickly worked that out. And so they didn't have to keep allocating those bad treatments to trial participants through time, because it was adaptive and did things through time. 

Jeremy  18:41  
Excellent. I'm glad to hear we discarded the bleach option quickly.

Yes, so I think that's really, really exciting. And I think a real triumph of that has been the speed at which they've been able to come up with effective treatments. So what a success story for, for the field. 

David  18:59  
Yeah, so you get that during the trial, instead of dividing it evenly among the different treatments, and then try all of them till the end of the trial, being able to be adaptive, and figure things out and adapt to what you're seeing as you go was critical for the speed of that trial. 

Jeremy  19:14  
David, thank you very much for joining us in the data cafe today. That was really fascinating. 

David  19:19  
Thank you. It's been really enjoyable chatting with you.

Jason  19:25  
One of the points that I really, really liked that David highlighted was about how much effort or expense or cost of some sort of takes to really refine a model to a degree that maybe is not necessary because the decision won't change and echoes some of our earlier conversations on the podcast about how important it is to consult with your user, your subject matter expert, the customer, if you can about what is the decision or the action that you're going to take and how we reached a level where we've given you the ability to confidently do that. And any additional effort kind of above that, MVP, we've mentioned before that minimum viable product, if it's now viable, let's get that decision made and find the gains in the customers benefits, before we go back to enhancing the model further. 

Jeremy  20:20  
So I think every data scientist at some stage in their life tends to come across the situation where you realise you could refine and improve quantitatively, your model your algorithm, your approach, to a greater extent, but actually, you know, down the line, you realise that the last three weeks maybe of work hasn't led to any kind of improvement in the decision, you'd still have had exactly the same outcome from your algorithm. And so, you know, in hindsight, maybe you could have stopped previously, when the algorithm was just good enough, and just giving you that decision process, which is properly optimal for the use case you have, it doesn't have to be perfect. I think, you know, perfection can be the enemy of data science, sometimes, especially in the industrial context. 

Jason  21:08  
Yeah. And that industrial context speaks as well to supply chains or longer processes where you may be part of that process. But there could be something feeding it that itself has an error bar, let's call it and that error bar doesn't need you to then tweak beyond that level, because you can't carry that forwards. If you've got such high accuracy. It's built on top of an earlier process, and then has a downstream aspect to it that could have its own compounding areas. And we've seen this in our physics classes where you have to look at how your errors compound, this is really telling when we look at what tuning and fine tuning we can do to a really cool model, you know, especially in machine learning, for example, but is it worth it? 

Jeremy  21:58  
Absolutely. I also liked the contextual side of David's talk, which was to say, particularly the landing this fun example, in a topic, actually, we've covered to some extent before on data cafe, which is that of multi armed bandits. And the particular additional flavour now that we have for multi armed bandits, which he's given us the heads up is contextual, multi armed bandits. So I think probably, we need to give a little bit of guidance now on how that fits with the apple tasting. 

Jason  22:38  
I read a paper by Google that they put out about their autoML, but it was for contextual bandits. And the line was the setup for a contextual bandit problem is that an agent observes repeatedly a context perform an action and receives a reward that depends typically stochastically on both the action and the context from the environment. And I thought it might be useful to unpack that definition a little bit, because you're bringing in that context from the environment. So you're no longer just doing a sample, which is your action and looking at the reward you get. But you have the context additionally now. 

Jeremy  23:21  
Yeah, so what does context mean? In the scenario we have, and then in other scenarios, as well, but in the in the apple tasting context, the contextual element is the discoloration on the surface of the apple, or it's the it's the mould or it's the the the misshapen Apple is going in front of the test. It's crucially this signal, this notion of signal, which is an indication of something being wrong. And that's the context from the environment. It's some level of sensing that's going on. That is not itself, a complete giveaway. It's not giving you all of the information, it's cheap to gather, it's easy to see or observe or measure or collect. But it's not maybe able to give you the complete deep exploration into into the underlying problem. 

Jason  24:12  
Yeah, and I read them another line that said, the goal of a bandit formulation is to minimise regret, which it defines as the difference between the cumulative reward from what would be the optimal policy and the trained agents cumulative sum of rewards. So again, to unpack it, if I regret trying all of the good apples, that a high cost to me, and the terminology of regret is really nice to describe it there. Maybe I'm gonna get a stomachache. I've tried so many apples at this point, when the bad ones are the rarer ones. And I don't want to have such a high regret cost by trying all the time. 

Jeremy  24:56  
Exactly. We've got this notion now. of the algorithm, the bandit algorithm, which is just it's a lovely setup, it's just an algorithm where you have these leaders in front of you these options, these decisions that you can take. And each one of them has a reward distribution, some level of value, if you like, associated with it doesn't have to be the same, it can be the same, it can be pound for one two pounds for the other. But it's more interesting. If that varies. You've got this notion of therefore there being an optimal decision like that, you know, I should be always pulling the two pound lever, or I should I should alternate or something like that, we should, I should always be picking the rotten apples and discarding them. And I should always be letting the good apples go to market would clearly be the optimal policy and the regret is the difference between what you actually do and what perfection would look like. Where it to be achievable. Of course, there's the the 64,000. dollar question. 

Jason  25:58  
Yeah. And I think the way I was looking at regret is even not the right way yet. Because really, it's a regret for the business model. In this case, you regret that you could have sold those good apples, but you were wrong to try them, as opposed to letting them through and only removing the bad ones. 

Jeremy  26:20  
David actually indicated there was even another level of regret, it's not seen in those terms. But there's an extra loss or an extra cost, certainly associated with letting an apple go, which is that you don't get to find out. You don't get to learn the correspondence between the signal and the actuality of is it a bad apple? Or is it? Was it the right decision to take.

Jason  26:43  
And moving away from apples. And we can imagine any sort of a manufacturing pipeline, for example, drugs, and the process of drugs being either effective, say in the case of a painkiller, or statistically, maybe one of them is just whatever one in however many 1000s or millions is ineffective. How can we be sure?

Jeremy  27:04  
Yeah, exactly. This The nice thing about this as it applies to many different scenarios, and one of them, one of the obvious ones is manufacturing pipeline. So in the picture you're painting, if you had that painkiller and one in 10, was was ineffective? You wouldn't necessarily know to look at it. But maybe there's some signal, maybe there's some context, which you can test for.

Jason  27:26  
Test and not taste. 

Jeremy  27:28  
Right, right, exactly. Because obviously, the thing you can't do is take it and see because that obviously destroys the product. And might be risky in the case of a drug, of course, but maybe I don't know, maybe for the sake of argument, the weight of the product is indicative is your signal, maybe there's a tiny deviation in a in weight. But not it's not completely a complete giveaway, maybe it's just a 50% probability or something that the painkiller is not as good if there is this small weight deviation. But that's much higher as a signal than just the the one in 10, one and 10, 1 in 1000, or something that might be your random selection probability of getting finding one of these tablets, but you have to have a test. So you maybe you drop your tablet into a reaction vessel and find out. But clearly, once you've done that, you can't you can't sell it.

Jason  28:21  
And when we talk about this testing process, and we get to the numbers of one in maybe millions, you can see how testing something like say the weight of them, and maybe it has to be a really accurate test if that weight difference is so small, has its own expense. And we really need to balance, how much testing against all of the expense against all of the reward, and what our measure of regret is, and how important the quality is based on who our end customer might be. 

Jeremy  28:51  
Yeah, I mean, there is interesting analogy with the current pandemic situation isn't there, you know, the testing has a cost. It's not, it's not infallible, some of the tests are better at producing true positives than than others and false negative rates as well. So, you know, it is only a signal, which is indicative of do you actually have the virus or not? So I think there's some really interesting analogies with the current testing framework that we're all learning so much about, but you're absolutely right. I mean, as soon as you start playing with the costs of these activities, the decision environment becomes really interesting. Imagine it's not a painkiller. In your example, imagine it's a some expensive cancer drug or something that costs you know, 2000 pounds of dose, you're prepared to probably put in place quite a sophisticated sensing framework in order to avoid losing a 2000 pound dose of really important drug in that setting. So it can make a massive difference to what you're prepared to do and where you're prepared to play. Your investments in that scenario. 

Jason  30:02  
Another thing that occurs to me as we talk about it is maybe there's a learning that also happens off the back of it. Because if in the apple case, again, I learned that a certain level of discoloration is normal, maybe for a red apple. And there's a crossover before that discoloration indicates that it's a bad apple, I might be learning these levels, these thresholds. So once I have more and more information, my entire process can improve over time. It's not static in its own right. 

Jeremy  30:35  
Yes. So in the greater scheme of things, this is clearly as all multi armed bandit problems are, this is clearly a reinforcement learning problem. And if you know nothing, if you have no initial Association, in your learnt memory store, of signal to outcome signal to decision, then that's something where you can learn by testing, you can test everything. In this case, you lose everything, or you can test at a very high rate, and let less through but then the more you test, the more you learn, the less you test. Conversely, the less you learn, the hard thing that David alluded to, of course, is what happens if halfway through your learning process, or indeed, just halfway through your manufacturing process, suddenly, the signal to outcome process itself changes, either maybe you've got a fault in your measurement device, or maybe the manufacturing process kicks over without anyone telling the people doing the testing into another manufacturing process. And suddenly, there's something else that could go wrong, or something else that is now indicative and the weight is fine. Maybe they fix the weight issue completely in the manufacturing process. So if that signal changes, and the the relationship, statistical relationship between the signal and the outcome of whether it's a good product, bad product, faulty product, well working product changes, then you're picking that up and doing so quickly, is really challenging. And that's clearly where David sees a lot of the new, new interesting research going into in the future.

Jason  32:13  
So I'm gonna ask my usual question about what the cutting edge might be. 

Jeremy  32:17  
We haven't talked about possible algorithms or how to apply this and get value from it, in the sense that it's a reinforcement learning problem. And it's a contextual bandit problem, you know, there are some there are good tools out there. Which pick up on these topics, if you like or have experience, within your experience is your observation paired with your outcome or your test, and the reward that you've got it over the cost that you saved as a result of that. So in the sense of it being a reinforcement learning problem, you can build your process around that I think what David's looking at, is more detailed and has more statistical meat to it. He talks in the talk he gave us about Thompson sampling, which looks at the having prior belief about your environment and about how the signal is related to the outcome. And that's, that's, you know, super interesting. So there's some really nice things to get stuck into, into the but I think the real cutting edge is where you have what's called non stationarity, which is a statistical way of saying when stuff changes, right, so stationarity is your friend stationarity is when the amount of rewards you get when you pull a particular lever or you you choose Apple, good Apple bad apple is pretty much the same throughout your experiment throughout your manufacturing process. non stationarity is where you have these changes, and it's either non stationarity in reward is a bugbear for reinforcement learning. And for bandits, in particular, you want your rewards to be broadly predictable. It can be random that can be sampled from a distribution, but it's when that distribution changes that you're in potential trouble, or at least you have to have algorithms which are prepared to react to that. And the question is how fast can you react to those changes either in the reward itself or in the this signal to outcome association that David alluded to. 

Jason  34:18  
So we can kind of see how the manufacturing process has some level of stationarity to it. But if we brought this into the world of advertising, you're going to have so many more factors, because customer demands could change quickly. You could bring in seasonality, you bring in fashion trends, bring in unknowns, you know, that kind of phenomenon that just take off, and you don't know how people will react

Jeremy  34:45  
...a massive pandemic!

Anything that just drove a coach and horses through all of your prior experience. 

Jason  34:52  
Really, really interesting stuff. I think we really got to the core of this apple tasting problem.

Thanks very much Jeremy

Jeremy  35:00  
Thanks Jason.

Jason  35:02  
Thanks for joining us today at the DataCafé. You can like and review this on iTunes or your preferred podcast provider. Or if you'd like to get in touch, you can email us Jason at or Jeremy at or on Twitter at datacafepodcast. We'd love to hear your suggestions for future episodes.

Transcribed by

Interview with Prof. David Leslie