What exciting data science problems emerge when you try to forecast an election? Many, it turns out!
We're very excited to turn our DataCafé lens on the current Presidential race in the US as an exemplar of statistical modelling right now. Typically state election polls are asking around 1000 people in a state of maybe 12 million people how they will vote (or even if they have voted already) and return a predictive result with an estimated polling error of about 4%.
In this episode, we look at polling as a data science activity and discuss how issues of sampling bias can have dramatic impacts on the outcome of a given poll. Elections are a fantastic use-case for Bayesian modelling where pollsters have to tackle questions like "What's the probability that a voter in Florida will vote for President Trump, given that they are white, over 60 and college educated".
There are many such questions as each electorate feature (gender, age, race, education, and so on) potentially adds another multiplicative factor to the size of demographic sample needed to get a meaningful result out of an election poll.
Finally, we even hazard a quick piece of psephological analysis ourselves and show how some naive Bayes techniques can at least get a foot in the door of these complex forecasting problems. (Caveat: correlation is still very important and can be a source of error if not treated appropriately!)
Some links above may require payment or login. We are not endorsing them or receiving any payment for mentioning them. They are provided as is. Often free versions of papers are available and we would encourage you to investigate.
Recording date: 30 October 2020
Intro music by Music 4 Video Library (Patreon supporter)
Thanks for joining us in the DataCafé. You can follow us on twitter @DataCafePodcast and feel free to contact us about anything you've heard here or think would be an interesting topic in the future.
Hello, and welcome to the Data Cafe. I'm Jason.
And I'm Jeremy. And today we're bringing you a Data Cafe US election special.
I mean, I want to ask, why are we doing this? I think I know why we're doing this. We've no interview today, but you're going to take us through some of the insight that you've delved into with the upcoming, I mean, almost immediate US election, Jeremy.
Yeah. I mean, We did ask Vice President Biden, whether he was available, but he's he's apparently on some kind of speaking tour at the moment. So no indeed so,...
But no doubt his people got back.
Oh, yeah. They were very polite. So, yeah, we noticed that there was an election going on. And it was very exciting, watching all of the discussion and analysis of this. So we thought, well, it'd be nice to have an episode. And then we realised that to be honest, it was probably a good idea if we did the episode before the election actually happened. Yeah, rather than rather than take our usual very, very easy time about it.
Even on that you say, we knew there was an election coming up. I mean, I, I hold my hands up, I have been trying to minimise my exposure to the news and to the absolute influx of information, pandemic and everything aside, so I'm just dipping in and out to make sure that nothing has gone absolutely crazy. But yeah, what's going to be very interesting about this, I think, is coming at it from a data science perspective. So I'm going to ask you, Jeremy, why is this relevant to data science, first of all?
Sure, so this is a bit of fun, but I think has some really serious topics behind it, which I hope we can explore briefly today. So if you look at what an election, you know, represents it's it's an expression of the of the will of many people voting to say, I would like this person to be president, or I would like this other person to be president, or it might be a senator or congressman or anyone else that you're electing, of course, but in the run up to those elections, you have something quite quite interesting happening, which is many organisations, different organisations, constructing polls of people in a particular area. So in the US, the polling people in various key states around the country, and they're trying to infer from that how those people are going to vote, not just the people I interview, of course, but how that's the whole state is going to vote in in the upcoming election. So that's really, I think, that underpins so much of statistics, just right there, you've got, you've got a big population of people representing the population of the country or the population of the state. And you haven't got the resources to ask every single one of them How you going to vote? And even if you did, there might be some interesting things where they might lie to you or they might not, they might not want to answer. But what they're going to do is they're going to ask 1000 people, or 1200, people, whatever, to give them their preference. And they will infer from that how the rest of the state or the rest of the country is going to vote in the election.
Yeah. And this sense of asking people, it seems so simple to me from the outset, as a kind of setup experiment that the answer is, I mean, it's not yes or no, but essentially, it's one or the other, you kind of just have two boxes that you're kind of going to take here, right. But that simplicity, is then diluted by all of these other factors that you've already mentioned about the idea of sampling. Now, the issues and the fact that people are autonomous and can change their mind to amazing evasion, the Lexan is that this is not as a physicist, what I would like to think of as a perfectly spherical experiment in a vacuum, everything in the lay of the land moves so much.
So one of the big questions is like when polls go wrong, you know, what, what kind of an effect can that have?
Yeah, I mean, you think it would be simple, wouldn't you think, Oh, well, we just asked a few people, they'll tell us what they think. And then we'll just present that as the view of the overall voting population. But oh, my word. It's, it's so much more complicated. You have many different things to consider. So maybe as a case in point we'll have a look at the previous presidential election. So this is when Hillary Clinton was standing against Donald Trump in 2016. And the polls were coming out. He really right up to the close to the election, saying that they thought that Hillary claimed Someone's going to, to win the election. And when they delved into that, it turned out that when they've been asking people how they were going to vote, they had been getting responses. Now they also collect information about the person that they're asking, ascertain if they're male, male or female, but the genders important age is important are this. Are they a senior? Are they you just out of college? Do they have a job? What sort of job at the end? So there's quite a lot of contextual information that gets collected and collected when they when they do these polls, and then they have a look at the overall breadth of the sample that doesn't have enough people from the age group over 65. Say, does it have enough people in the 40s to 50s, bracket, and so on? And if it doesn't have enough people in these brackets in these various categories? And by enough, I mean, isn't basically reflecting the overall makeup of the population of that state?
Okay, right. But that sub sample is not representative.
Exactly. And that's, and that's really I mean, and so you get this whole notion of is my sub sample biassed in some way? Is it? Is it? Did I just get unlucky did I not get the right quite the right mix of people. So in 2016, famously, they didn't quite get the right mix of people with college education versus non college education. And they just didn't realise that was a big deal. And it turned out that was a call a prior a factor, which would massively impact whether you voted for either Hillary Clinton or Donald Trump. And because they hadn't corrected for that. And they discovered, in fact that they had a bias in favour of people who had a college education, then the the their samples are coming out, giving them a very strong indication that they thought Hillary Clinton was going to win, having since looked at that, and they've now got lots of correcting factors in that operators to be able to allow them to produce a more balanced sample, which then allows them to infer, hopefully, a more accurate poll result at the end of the day.
Yeah, this sounds straightaway, like, a kind of Bayes setup, right? You've got these priors that you're looking at. And then we've updated just from this kind of top level, a previous election, what have we learned, and we take that forward to the next one? I mean, in very simple terms, it's learning as we as we go. Yeah. And what you're getting at as well sounds like part of the push that we have for turnout, and turnout is so important, because that's your full representation of the population, which is really interesting as well.
Yeah, I think you've got lots of unknowns, when you would, when you're doing a poll, you might have worked really hard to get this lovely balance sample up in your poll, but then it comes to polling day. And actually, seniors turnout in twice the numbers that you're expecting or college graduates turnout in, you know, 20% higher numbers than you were expecting, or although you might have matched the population that lives or the voting population that lives in that state, you'd have to have a model for what the actual population who are actually going to vote on on election day it's so so the the really good quality pollsters now you have no notion of there being pollsters who do it properly. And pollsters you may be using slightly older techniques will have a simultaneous with collecting all of this contextual information from the the people that they're polling will also ask them questions like how likely are you to vote? And in some cases, of course, right now, as we're going to discuss, they'll have voted already, or something like 70 75 million Americans have already voted in this election with Okay, the now election to a being was it three days away or something? And that becomes the sort of the close of the election period now. But they've got to keep track of you know, whether people have voted. So maybe if they say they voted they they give them a probabilistic weighting of one that they are going to vote because they have, maybe if they say, Yes, I'm very likely to vote, they only give them a weighting of maybe 90% point nine, that they're going to get into vote because they say they are also there's a high probability, but it may be not entirely certain yet, and so on, and maybe you know, down the ranking to someone who says, No, I'm not going to vote. So you would attribute them something to take account of that as well. So then, that helps you then map your sample, demographically nicely balanced to one, which is then going to match the actual voting sample, which is the one that matters. And then so I don't think there's necessarily realisation that, you know, I mean, we call it democracy, and it's sort of this wonderful thing, but actually, there's so many things that can slightly get in the way of it, you know, it could be bad weather, it could be long queues at the polling station
For the sake of argument, it could even be a massive pandemic virus. Yes.
And like how much of an effect is that going to have? Because I mean, I'm seeing things online, about massive queues, and the queues are socially distant. So just as a human, I see that and I think, well, how much do I want to vote? Because this is 10 blocks long. For example. Have I missed my window for a postal vote? Or am I just nervous? Am I not going to go out? Am I at risk?
Yeah, it's, it's, it puts everything into perspective. I mean, we're very lucky in the UK. And I don't think we necessarily appreciate it and know that we have so many voting locations. And, you know, comparatively, I think, a really quite well resourced and non-partisan sort of voting apparatus. Broadly, that seems to be I think the thing true, but yeah, you look at the US, and there seems to be, you hit a lot of people who waited in line for an hour and a half to vote. And that's good. That's that's really, really a good day in the voting firmament almost. But But then, you know, I've heard of people that waiting in line for 11 hours to vote, you think, how did you manage to squeeze this into a single day? And then of course, you're absolutely right. So they've had to, they've had to implement sort of mitigating measures to take take account of this. So they had a primary season, which is where they're try and nominate the original candidates for each party, the Democrat and Republican parties. So they were able to refine this a little bit the state who run run these elections. So I think things are sort of a little bit better set up than they were. But they've, you know, some states have introduced postal voting, some states have introduced mandatory cross state postal voting, some states have introduced some absentee voting or made it very easy to absentee voting, which is sort of very similar, then there are other states like Texas that have put sort of voting drop boxes around various counties, and that's created lots of issues. So yeah, a lot, there's been lots of different possible mitigations that people and states have put in place to try to try to get over this, this really difficult issue of To be honest, it's probably not great, the height of a pandemic, to have people standing in line and then crowding into a sports centre or something to to to cast their vote. So they've had to think quite creatively, I think around that.
And so all of these people are coming together and voting. And I, you know, in Ireland, when we have a vote, it's proportional representation, right, all element of mathematical thinking that has to go into how that plays out. And you know, it's not as simple as first past the post in some places. But we also have here the electoral college rules to consider,
Yes, it's probably worth taking a couple of minutes just to explain for people who aren't entirely living and breathing, the US elections at the moment, what exactly the context and the rules are for, for the election. So let me start with, there's actually many elections happening. It's not just one. So there's the presidential election, there's also elections to the Senate. And then there's also elections to Congress. And then there's also state elections as well. So there's lots and lots of different elections. But we're really going to for the purposes of what we're talking about today, we're just going to focus on the US presidential election. So you might think from one of the countries which likes to pride itself on its democratic credentials, that, that you'd have a plebiscite, you'd have a poll, you'd count the number of people who'd voted for one candidate over another. And that would be that would give you their favourite candidate, they would win and they would become president. But it's not quite as simple as that. So what the founding fathers did was to say, we're going to introduce a form of electoral redirection in this process whereby we'll introduce a way for people to vote for candidates who will in turn, vote for the President. So just to give you an example, if you're voting in the state of Pennsylvania, you will cast your vote for either Vice President Biden or President Donald Trump, or one of a number of other smaller candidates, and then you'll tot up the total votes. And the winner in that sort of first past the post election within Pennsylvania, will then get to nominate a slate of candidates. And the number of those candidates is important. So Pennsylvania, it'll be 20. Candidates who themselves will all be instructed to vote for, say if it was Donald Trump, who won Pennsylvania, they would all be instructed to vote for Donald Trump in what's called an electoral college. So then all of these Electoral College candidates from all over the US come together, and they represent how their electorates individually voted roughly in proportion to in terms of the number of those candidates roughly in proportion to the populations of people within those states. So that's that's sort of how it fits together. And there's there's so for the people who want to know, there's 538 of these in total, these candidates, and so you need 270 as your threshold, if you'd like, of these college votes in order to be elected president.
It's amazing to think of it that way. When you look at the popular vote in the last election, and Hillary Clinton won it by I think it was 3 million votes. Yeah. And with the Electoral College, and from the States represented that way, voted for Donald Trump. Yes. So it's really interesting way to consider it and see how it plays out. And it's interesting as well, to speculate how I mean, not speculate, we would look at the data to see how does the population of each state actually sit, and their representation for the Electoral College, which is another way to slice it. Because when you look at the whole map, like all of the lower populous states in the middle, tend to be Republican. So the the map looks like it's very dominantly, red, when you have a colour against the different votes, it's, it's a funny way to picture it. But you need to delve in and understand this.
Yeah, I completely agree. I think the geography of the US, where some states are much more densely populated others much more sparsely populated, doesn't really give you a good feel for where, where these electoral college votes are coming from, I think you could, you could redraw the map, and it would it would squish in from the sides. So you get very, very big looking states from like California, like Texas, like New York, that represented substantial populations with very, very much dominate an electoral college map, which you don't get to really see from the usual geographical representation.
And that seems like even then, visually, based on that data, we would pull out a visual representation of those swing states, which are so important than the ones that people are going to be looking to closely on election day.
Yeah, so I think it's important to just pick this apart, because it really does hang on this notion of a swing state and what a swing state really is. So the way that a lot of the sort of analysing organisations look at and read the race is by essentially ranking all of the states into into a long line, and you put all of the democrat leaning states on, say, over to the left hand side of the line, and all of the republican leaning states onto the right hand side of the line, and then you rank them, so you have everything that's most democrat over to the left. And then gradually, as you come towards the centre of the line, it becomes more marginal as to how the state is going to work. But still a little bit, Democrats say and then it switches over to the more marginal republican states who are likely to vote for Donald Trump. And then that goes all the way over to the really dead certain states that definitely gonna vote, vote for Trump, on the far right hand side. So the swing states, obviously, the ones in the middle, there are these uncertain states where they could tip over to Trump or they could tip over to Biden, depending on, you know, fairly small fluctuations in voting, turnout and intention on the day.
It's that really small fluctuation that makes it so interesting from a data point of view to me, because I would immediately think, why don't we just poll more heavily in the swing states, but you're still going to have to deal with that level of error that comes in off the sampling and the setup and the fluctuations that can happen right up until somebody actually goes and vote. So it's amazing to think how difficult this is to predict.
Yeah, it's an incredibly difficult when it gets tight, because if you think about it, a population of California is like 40 million. And, you know, if you're polling, maybe 2000 people, that's a tiny, tiny percentage of the population. So the idea that even with all of these nice balancing techniques that you might use to adjust your adjust your poll sample, the idea that you would, you know, get to, you know, 1% within 1%, or, you know, point 1% of the actual final vote on Election Day, from from a poll of 1000 to 2000 people is just not going to happen, it's very unlikely. So, so typically, when you're pulling that number, you'll see that of polling errors are given sort of in the three to four, four and a half percent range. So that, you know, that already gives you an indication that if, if the status closer in terms of its race between Donald Trump and Joe Biden closer than 3%, then you've got a difficult prediction job on your hands and it really means you probably won't to be sampling a few more people in your poll to get that sort of fine granularity, and then pick out those, those states. And luckily, we do in the sense that because those those swing states are inherently interesting. And so lots of polling organisations conduct these polls, and so you can stitch them together and get some, some belief, some probability from the collection of all these hundreds of polls that are going on, even if it's very close.
So I was gonna ask you then about how we bring together all these data sets, and what type of methods we might delve into. I mean, even right there, bringing together different polls sends like a version of your ensemble approach, you know, you're trying to balance out some of the stronger and weaker, and polling techniques maybe.
Yeah, it's, it's a really, really deep topic. And I sort of have to say, I'm sorry, I'm going to defer now to the people are doing a really amazing job of this. So there's a organisation, people may have heard of call 538. And they have overall ensemble method of aggregating all of these polls as they come in. And they attribute sort of a level of certainty to them, judging by how well they're carried out, on some methodological basis, they look at the size of the sample. And either they then take all of these polls for a given state are able to come up with a probability of loss of victory for a given candidate. So for instance, I looked last night and Wisconsin, for instance, is 93%, likely to vote for Joe Biden and therefore 7%, with the current statistics, and polls, likely to vote for Donald Trump so so that's, that's the sort of the sort of result they can generate from all of this polling data, you know, they've got to really be over all of the possible errors and methodological approaches to be able to allow them to do this.
I love on the website that it says we start with 40,000 simulations that our election forecast runs every time it updates. And it's just a lovely example of the power of data and computational resource right now. I mean, I imagine that's updating pretty frequently. So it's a really cool example of pulling together so much data and building that pipeline,
It feels to me I don't know, it feels to me like they're building that building this nice probabilistic models of each state. And then they're running a sort of what we call a Monte Carlo simulation. They're essentially rolling bias dice. So for Wisconsin, it will be a dice which 93% of the time falls for Joe Biden and 7% of the time falls for Donald Trump. And then they're seeing like, what does that entail? So it's quite interesting, because because the way the electoral college works that we talked about, if the dice comes out in favour of Joe Biden, then he gets all 10 electoral college votes. Whereas if it comes out in favour of Donald Trump, even though that happens very infrequently, then Donald Trump gets all 10 electoral college votes for Wisconsin. So so it's very bitty. In that sense. There's no there's no sort of partition of, of these electoral college votes, except, I have to say, now, in two cases, I think it's Maine and Nebraska, where they they do actually partition them a bit more carefully. And so you could end up with a split going out to the different candidates. But broadly, in all the other states, that doesn't happen. It's it's either winner takes all for the winner of that state.
So with all of these methods and approaches in mind, how should we think about taking benefit from this in our data science perspective?
What's really interesting about the probabilities that these organisations and 538 in particular come up with is that they are very, obviously highly interdependent what we call correlated from a statistical point of view. So
I wondered that because even in those 40,000 simulations is felt like, you know, some of them will have one outcome within it that has a knock on effect and other outcomes happen. So it's not just the roll of the dice, there's a dependency and how the dice last rolled for that other state, and then you roll for the next one.
It's factors like that. And it's factors like in their model, they'll have a certain proportion of the probability which is attributed to what does the polls say what are the last five polls say in that state, and then a factor of their model, which is just derived from the fact that Donald Trump is the current president, the United States and that tends to give the incumbent a bit of an advantage over the challenger. And there are other nationwide sentiments which cross state boundaries and will tend to drive opinion in Florida and in Texas, and in the South Dakota and in Oregon, you know, so it's not just a set of totally independent elections that are happening in in their own their own bubble, there is this nationwide picture, which does drive a lot of voting sentiment. And so they have to take all of that into account when they're constructing these,
Right. So that's where if one state has rolled their dice, if that comes out a certain way, the next date dice roll is kind of updated based on that information in our simulations.
Oh, that's so so right. So if for the sake of argument, Donald Trump wins Florida, then that's really significant, because then the probability that Donald Trump wins Arizona is not just a standalone probability, it's the it's it's what you referred to earlier, it's a Bayesian probability is the probability that Donald Trump wins Arizona, given that he's won, Florida. And indeed, given that he's won lots of other states or not what lots of other states, it's given, the exact configuration of votes that they will see on election night coming in, will drastically change the probability of winning in individual states. So it's a very good example where independence between these these states is is not is not a given and needs to be, needs to be taken into account. But interestingly, we can still do something quite naive, and quite simplistic, but still be quite useful in even in even in quite a difficult environment like a national election.
And you said it's naive. Are we getting a lead in here to Naive Bayes by any chance?
Exactly. So Naive Bayes? You know, for those of you who've looked at this technique, before you very much look at your features in a in a machine learning context or an NLP context. And although you're trying to calculate quite a, gain quite an involved conditional probability, what's the probability I see this feature, given that I've had all of these other features present themselves in my model, what you tend to do is assume that they are actually independent. And you come you do the calculation, which is much easier if you have independent random variables, anyway, and you come up with something which which gives you an approximation and isn't approximation to the end result. So in the context of the election, there's a very nice, just serendipitous sort of picture, which has just emerged, which I think is quite fun. Which is to say that there are three states as of today, where Vice President Joe Biden has roughly a two thirds probability of winning, he happened to be in a position in this electoral line of states that we talked about, where they are very, very significant. So the three states are Florida, North Carolina, and Arizona, in any of those states, if Joe Biden happens to win them, then the election pivots wildly into Joe Biden's favour. So if he wins any of Florida, or North Carolina, or Arizona, then the probability that Joe Biden in this case wins the election goes to somewhere north of 99%. And it becomes virtual certainty. And that's because of the number of electoral delegates that each of those states have. So for Florida, it's like 29, for North Carolina, 15. And for Arizona 11. And there are other states as well. But it just just jumped out at me that there were this probability of two thirds that that he were he was going to win based on the 538 model based on all his polling data. And I thought, that seems quite surprising that the overall the the these overall polling models are saying that Joe Biden's got about 10, 11% chance of not winning. And I thought that that seemed quite surprising, given that there were these three pivotal states which would drive it and when I, I did the calculation, and I went, right, I'm going to do this naively, I'm going to use a naive Bayes and assume that they're all independent, I'm going to essentially toss a coin. What What do I get? Well, it turns out to be a very easy calculation to do if it were a coin. And it was, if it wasn't probability of two thirds, it was just probability of half. And Joe Biden only had to, you know, get ahead. If you like in only one of these states, then his overall win probability will be at seven and a half percent and very naively, and that's where the probability of a half so with a probability of two thirds, it goes up into into the 90s. So it's actually sort of nearly 96% probability that Joe Biden will win at least one of Florida, North Carolina and Arizona, so he might win two, or he might win or three or you might only win one, but when you add up all of those independent outcomes, and taking that as an assessment And you get this 96% figure. Now we have to be a bit careful because I didn't say it was absolutely certain that Joe Biden would win the election if he were to win one of those states. So I said it was about a 99% probability that he would win the election, which seems pretty high to me. So you factor that in the end, the probability drops to some 94 95%, something like that, to win, but that's still, you know, good 5%, higher than the current models are showing for Joe Biden winning overall. So they're currently showing him an 89 to 90% probability of winning. But these three states are so important, and they have quite a high probability at the moment. And of course, you know, the probabilities may change as we know. Yeah. You know that. I think that means there is a lot of inherent uncertainty and conservatism, even now in the models around awarding these very high numbers to the democrat candidate, Joe Biden.
Wow, you can really tell how edge of the seat this is and how edge of the seat it's going to be on Tuesday. So I imagine that that would be keeping you will all night, Jeremy and probably a lot of people around the world and it's really interesting.
Yeah, I have a very horrible feeling. We're going to be kept up all night on Wednesday and Thursday as well.
Yeah, it will go that way. Thanks, Jeremy. Thanks Jason.
Thanks for joining us today at the Data Cafe, you can like and review this on iTunes or your preferred podcast provider, or if you'd like to get in touch, you can email us Jason at datacafe.uk or Jeremy at datacafe.uk or on Twitter at datacafepodcast. We'd love to hear your suggestions for future episodes.