Can polls be reliable?

Election polls in many countries have seemed unusually unreliable recently. Why? And can they be fixed?

The most basic observation is that if one has a random sample of a population in which x% has some attribute then it is reasonable to estimate that x% of the whole population has that attribute, and that this estimate will tend to be more accurate the larger the sample is. In some polls sample size can be an issue, but not in the main political polls.

A fundamental problem with most polls is that the ‘random’ sample may not be uniformly distributed, with some sub-groups over or under represented. Political polls have some additional issues, that are sometimes blamed:

  • People with certain opinions may be reluctant to express them, or may even mislead.
  • There may be a shift in opinions with time, due to campaigns or events.
  • Different groups may differ in whether they actually vote, for example depending on the weather.

I also think that in the UK the trend to postal voting may have confused things, as postal voters will have missed out on the later stages of campaigns, and on later events. (Which were significant in the UK 2017 general election.)

Pollsters have a lot of experience at compensating for these distortions, and are increasingly using ‘sophisticated mathematical tools’. How is this possible, and is there any residual uncertainty?

Back to mathematics, suppose that we have a science-like situation in which we know which factors (e.g. gender, age, social class ..) are relevant. With a large enough sample we can partition the results by combination of factors, measure the proportions for each combination, and then combine these proportions, weighting by the prevalence of the combinations in the whole population. (More sophisticated approaches are used for smaller samples, but they only reduce the statistical reliability.)

Systematic errors can creep in in two ways:

  1. Instead of using just the poll data, some ‘laws of politics’ (such as the effect of rain) or other heuristics (such as that the swing among postal votes will be similar to that for votes in person) may be wrong.
  2. An important factor is missed. (For example, people with teenage children or grandchildren may vote differently from their peers when student fees are an issue.)

These issues have analogues in the science lab. In the first place one is using the wrong theory to interpret the data, and so the results are corrupted. In the second case one has some unnoticed ‘uncontrolled variable’ that can really confuse things.

A polling method using fixed factors and laws will only be reliable when they reasonably accurately the attributes of interest, and not when ‘the nature of politics’ is changing, as it often does and as it seems to be right now in North America and Europe. (According to game theory one should expect such changes when coalitions change or are under threat, as they are.) To do better, the polling organisation would need to understand the factors that the parties were bringing into play at least as well as the parties themselves, and possibly better. This seems unlikely, at least in the UK.

What can be done?

It seems to me that polls used to be relatively easy to interpret, possibly because they were simpler. Our more sophisticated contemporary methods make more detailed assumptions. To interpret them we would need to know what these assumptions were. We could then ‘aim off’, based on our own judgment. But this would involve pollsters in publishing some details of their methods, which they are naturally loth to do. So what could be done? Maybe we could have some agreed simple methods and publish findings as ‘extrapolations’ to inform debate, rather than predictions. We could then factor in our own assumptions. (For example, our assumptions about students turnout.)

So, I don’t think that we can expect reliable poll findings that are predictions, but possibly we could have useful poll findings that would inform debate and allow us to take our own views. (A bit like any ‘big data’.)

Dave Marsay

 

Uncertain Urns Puzzle

A familiar probability example, using urns, is adapted to illustrate ‘true’ (non-numeric) uncertainty.

Simple situation

The following is a good teaching example:

Suppose that an urn is known to contain black and white balls that are otherwise identical. A subject claims to be able to predict the colour of a ball that they draw ‘at random’.

They ‘predict’ and draw a black ball. What are the odds that they are really able to predict?

From a Bayesian perspective, the final odds are the initial odds times the likelihood ratio. If there are b black and w balls and we represent the evidence by E and likelihoods by P( E | ), then P( E | Predict ) = 1 and P( E | Luck) = b/(b+w). Thus the rarer the phenomenon predicted, the more a correct prediction tends to support the claim, of reliable prediction. 

Common quibbles

There is, however, some subjectivity in the estimated probability that the subject can predict:

  • In this case, the initial odds seem somewhat arbitrary, and Bayes’ rule seems not to apply. For example, have you considered that the different colours may result in different temperatures? Such a thought is not ‘evidence’ in the sense of Bayes’ rule, but might change your subjective estimate of the probability prior to their draw.
  • If we do not know the proportions of black and white balls for sure then the likelihood is uncertain.

 Multiple urns

Here we introduce a different type of uncertainty:

Suppose now that the subject is faced with two urns and selects a ball from one. Given the number of black and white balls in each urn, what is the likelihood, P( E | Luck ), of a correct prediction due to luck?

If you think the question is ambiguous, please disambiguate it however you wish.

Suppose you know the total numbers of black and white balls in the two urns. Is the likelihood estimate P( E | Luck) = b/(b+w) reasonable? Could it be biased? How?

See Also

A legal example  Other, similar, puzzles.

Dave Marsay 

Bretton Woods: Modelling and Economics

The institute for new economic thinking has a video on modelling and economics. It is considerably more interesting that it might have been before the financial crises beginning 2007. I make a few points from a mathematical perspective.

  • There is a tendency to apply a ‘canned’ model, varying a few parameters, rather then to engage in genuine modelling. The difference makes a difference. In the run-up to the crises of 2007 on there was wide-spread agreement on key aspects of economic theory and some fixed models became to be treated as ‘fact’. In this sense, modelling had stopped. So maybe proper modeling in economics would be a useful innovation? 😉
  • Milton Friedman distinguishes between models that predict well short-term) and those that have ‘realistic’ micro-features. One should also be concerned about the typical behaviours of the model.
  • One particularly needs, as Keynes did, to distinguish between short-run and long-run  models.
  • Models that are solely judged by their ability to predict short-run events will tend to forget about significant events (e.g. crises) that occur over a longer time-frame, and to fall into the habit of extrapolating from current trends, rather than seeking to model potential changes to the status quo.
  • Again, as Keynes pointed out, in complex situations one often cannot predict the long-run future, but only anticipate potential failure modes (scenarios).
  • A single model is at best a possible model. There will always be alternatives (scenarios). One at least needs a representative set of credible models if one is to rely on them.
  • As Keynes said, there is a reflexive relationship between one’s long-run model and what actually happens. Crises mitigated are less likely to happen. A belief in the inevitable stability of the status quo increases the likelihood of a failure.
  • Generally, as Keynes said, the economic system works because people expect it to work. We are part of the system to be modelled.
  • It is better for a model to be imprecise but reliable than to be precisely wrong. This particularly applies to assumptions about human behaviour.
  • It may be better for a model to have some challenging gaps than to fill those gaps with myths.

Part 2 ‘Progress in Economics’ gives the impression that understanding crises is what is most needed, whereas much of the modelling video used language that seems more appropriate to adding epicycles to our models of the new status quo – if we ever have one.

See Also

Reasoning in a complex, dynamic, world, Which mathematics of uncertainty? , Keynes’ General Theory

Dave Marsay

Illustrations of Uncertainty

Some examples of uncertainty, based on those invented by others. As such, they are simpler than real examples. See Sources of uncertainty for an overview of the situations and factors referred to.

Pirates: Predicting the outcome of a decision that you have yet to make

Jack Sparrow can’t predict events that he can influence. Here we generalise this observation, revealing limits to probability theories.

In ‘Pirates of the Caribbean’ the hero, Captain Jack Sparrow, mocks the conventions of the day, including probability theory. In ‘On Stranger Tides’ when asked to make a prediction he says something like ‘I never make a prediction on something that I will be able to influence’. This has a mundane interpretation (even he can’t predict what he will do). But it also suggests the following paradox.

Let {Ei} be a set of possible future events dependent on a set, {Dj}, of possible decisions then, according to probability theory, for each i, P(Ei) ≡ ∑j{P(Ei|Dj).P(Dj)}.

Hence to determine the event probabilities, {P(Ei)}, we need to determine the decision probabilities, {P(Dj)}. This seems straightforward if the decision is not dependent on us, but is problematic if we are to make the decision.

According to Bayes’ rule the probability of an event only changes when new evidence is received. Thus if we consider a decision to have a particular probability it is problematic if we change our mind without receiving more information.

As an example, suppose that an operation on our child might be beneficial, but has to be carried out within the next hour. The pros and cons are explained to us, and then we have an hour to decide, alone (in an age before modern communications). We are asked how likely we are to go ahead initially, then at half an hour, then for our final decision. It seems obvious that {P(Dj)} would most likely change, if only to become more definite. Indeed, it is in the nature of making a decision that it should change.

From a purely mathematical perspective, there is no problem. As Keynes emphasized, not all future events can be assigned numeric probabilities: sometimes one just doesn’t know. ‘Weights of evidence’ are more general. In this scenario we can see that initially {P(Di)} would be based on a rough assessment of the evidence, and the rest of the time spent weighing things up more  carefully, until finally the pans tip completely and one has a decision. The concept of probability, beyond weight of evidence, is not needed to make a decision.

We could attempt to rescue probabilities by supposing that we only take account of probability estimates that take full account of all the evidence available. Keynes does this, by taking probability to mean what a super-being would make of the evidence, but then our decision-maker is not a super-being and so we can say what the probability distribution should be, not what it is ‘likely’ to be. More seriously, in an actual decision such as this the decision makers will be considering how the decision can be justified, both to themselves and to others. Justifications often involve stories, and hence are creative acts. It seems hard to see how an outsider, however clever, could determine what should be done.  Thus even a Keynesian logical probability does not seem applicable.

Area

Wittgenstein pointed out that if you could arrange for darts to land with a uniform probability distribution on a unit square, then the probability of the dart landing on sub-set of the square would equal its area, and vice-versa. But some sub-sets are not measurable, so some (admittedly obscure) probabilities would be paradoxical if they existed.

Cabs

Tversky and Kahneman, working on behavioural economics, posed what is now a classic problem:

A cab was involved in a hit-and-run accident at night. Two cab companies, the Green and the Blue, operate in the city. You are given the following data:
(i) 85% of the cabs in the city are Green and 15% are Blue;
(ii) A witness identified the cab as a Blue cab.
The court tested his ability to identify cabs under the appropriate visibility conditions. When presented with a sample of cabs (half of which were Blue and half of which were Green) the witness made correct identifications in 80% of the cases and erred in 20% of the cases.

Question: What is the probability that the cab involved in the accident was Blue rather than Green?

People generally say 80%, whereas Kahneman and Tversky, taking account of the base rate using Bayes’ rule, gave 41%. This is highly plausible and generally accepted as an example of the ‘base rate fallacy’. But this answer seems to assume that the witness is always equally accurate against both types of cab, and from an uncertainty perspective we should challenge all such assumptions.

If the witness has lived in area where most cabs are Green then they may tend to call cabs Green when they are in doubt, and only call them Blue when they are clear. When tested they may have stuck with this habit, or may have corrected for it. We just do not know. t is possible that the witness never mistakes Green for Blue, and so the required probability is 100%. This might happen if, for example, the Blue cabs had a distinctive logo that the witness (who might be colour-blind) used as a recognition feature. At the other extreme (for example, Green cabs had a distinctive logo) if Blue cabs are never mistaken for Green, the required probability is 31%.  

Finally, a witness would normally have the option of saying that they were not sure. In this case it might be reasonable to suppose that they would only say that the cab was Blue if – after taking account of the base rate – the probability was reasonably high, say 80%. Thus an answer of 80% seems more justifiable than the official answer of 41%, but it might be better to range a range of answers for different assumptions, which could then be checked. (This is not to say that people to do not often neglect the base rate when they shouldn’t, but simply to say that the normative theory that was being used was not fully reliable.)

Tennis

Gärdenfors, Peter & Nils-Eric Sahlin. 1988. Decision, Probability, and Utility includes the following example:

Miss Julie … is invited to bet on the outcome of three different tennis matches:

  • In Match A, Julie is well-informed about the two players. She predicts that the match will be very even.
  • In Match B, Julie knows nothing about the players.
  • In Match C, Julie has overheard that one of the players is much better than the other but—since she didn’t hear which of the players was better—otherwise she is in the same position as in Match B.

Now, if Julie is pressed to evaluate the probabilities she would say that in all three matches, given the information she has, each of the players has a 50% chance of winning.

Miss Julie’s uncertainties, following Keynes, are approximately [0.5], [0,1] and {0,1}. That is, they are like those of a fair coin, a coin whose bias is unknown, or a coin that is two-sided, but we do not know if it is ‘heads’ or ‘tails’. If Miss Julie is risk-averse she may reasonably prefer to bet on match A than on either of the other two.

The difference can perhaps be made clearer if a friend of Miss Julie’s, Master Keynes, offers an evens bet on a match, as he always does. For match A Miss Julie might consider this fair. But for matches B and C she might worry that Master Keynes may have some additional knowledge and hence an unfair advantage.

Suppose now that Keynes offers odds of 2:1. In match A this seems fair. In match C it seems unfair, since if Keynes knows which player is better he will still have the better side of the bet. In match B things are less clear. Does Keynes know Miss Julie’s estimate of the odds? Is he under social pressure to make a fair, perhaps generous, offer? In deciding which matches to bet on, Miss Julie has to consider very different types of factor, so in this sense ‘the uncertainties are very different’.

(This example was suggested by Michael Smithson.) 

Shoes

If a group have a distinctive characteristic, then the use of whole population likelihoods for an in-group crime is biased against a suspect.

For example, suppose that a group of 20 social dancers all wear shoes supplied by X all the time. One of them murders another, leaving a clear shoe-mark. The police suspect Y and find matching shoes. What is the weight of evidence?

If the police fail to take account of the strange habits of the social group, they may simply note that X supplies 1% of the UK’s shoes, and use that to inform the likelihood, yielding moderate evidence against Y. But the most that one should deduce from the evidence is that it was likely to be one the dance group.

The problem here is that many (most?) people do belong to some group or groups with whom they share distinctive characteristics.

More illustrations

Yet to be provided.

See also

mathematics, paradoxes.

Dave Marsay

Induction and epochs

Introduction

Induction is the basis of all empirical knowledge. Informally, if something has never or always been the case, one expects it to continue to be never or always the case: any change would mark a change in epoch. 

Mathematical Induction

Mathematical induction concerns mathematical statements, not empirical knowledge.

Let S(n) denote  statement dependent on an integer variable, n.
If:
    For all integers n, S(n) implies S(n+1), and
    S(k) for some integer k,
Then:
    S(i) for all integers i ≥ k .

This, and variants on it, is often used to prove theories for all integers. It motivates informal induction.

Statistical Induction

According to the law of large numbers, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. Thus:

For two or more sufficiently large sets of results obtained by random sampling from the same distribution, the averages  should be close, and will tend to become closer as more trails are performed.

In particular, if one set of results, R1, has been obtained and another, R2, will be obtained, using the language of probability theory, if C() is a condition on a results then

P(C(R2)) = p(C(R1)), where P() is the probability and p() is the proportion.

Alternatively, p() could be given as a hypothesis and tested against the data. Note that for any given quantity of data, rare events cannot be excluded, and so one can never be sure that any p(x) is ‘objectively’ very small. That is, the ‘closeness’ in the law of large numbers always has some non-zero tolerance.

A key assumption of statistical induction is that there exists a stable ‘expectation’. This is only true within some epoch where the trials depend on that epoch, and not on any sub-epochs. In effect, the limits on an epoch are determined by the limits on the law of large numbers.

Empirical Induction

In practice we don’t always have the conditions required for straightforward statistics, but we can approximate. Using the notation as above, then:

P(C(R2)) = p(C(R1)),

provided that R1, R2 are in the same epoch. That is, where:

  • The sampling was either unbiased, had the same bias in the two cases or at least was not conditional on anything that changed between the two cases.
  • For some basis of hypotheses, {H}, the conditional likelihoods P(data|H) are unchanged between the two cases.

 Alternatively, we can let A=”same epoch” be the above assumptions and make

P(C(R2)|A) = p(C(R1)).

Induction on Hypotheses

Statistical induction only considers proportions.  The other main case is where we have hypotheses (e.g. models or theories) that fit the past data. If these are static then we may expect some of the hypotheses that fit to be ‘true’ and hence to continue to fit. That is:

If for all i in some index set I hypotheses Hi fit the current data  (R1), then for some subset, J, of I,  by default one expects that for all j in J, Hj will continue to fit (for future data, R2).

As above, there is an assumption that the epoch hasn’t changed.

Often we are only interested in some of the parameters of a hypothesis, such as a location. Even if all the theories that fit the current data virtually agree on the current value of the parameters of interest, there may be radically different possibilities for their future values, perhaps forming a multi-modal distribution. (For example, if we observe an aircraft entering our airspace, we may be sure about where it is and how fast it is flying, but have many possible destinations.)

Pragmatic induction

One common form of pragmatism is where one has an ‘established’ model or belief which one goes on using (unquestioning) unless and until it is falsified. By default the assumption A, above, is taken to be true. Thus one has

P(C(R2)) = p(C(R1)),

unless there is definite evidence that P() will have changed, e.g. a biased sample or an epochal change of the underlying random process. In effect, pragmatism assumes that the current epoch will extend indefinitely.

Rationalizing induction

The difference between statistical and pragmatic induction is that the former makes explicit the assumptions of the latter. If one has a pragmatic claim, P(C(R2)) = p(C(R1)), one in effect recover the rigour of the statistical approach by noting when, where and how the data supporting the estimate was sampled, compared with when where and how the probability estimate is to be applied. (Thus  it might be pragmatic – in this pedantic sense – to suppose that if our radar fails temporarily that all airplanes will have continued flying straight and level, but not necessarily sensible.)

Example

When someone, Alf, says ‘all swans are white’ and a foreigner, Willem, says that they have seen black swans, we should consider whether Alf’s statement is empirical or not, and if so what it’s support is. Possibly:

    • Alf defines swans in such a way that they must be white: they are committed to calling a similar black creature something else. Perhaps this is a widely accepted definition that Willem is unaware of.
    • Alf has only seen British swans, and we should interpret their statement as ‘British swans are white’.
    • Alf believes that swans are white and so only samples large white birds to check that they are swans.
    • Alf genuinely and reasonably believes that the statement ‘all swans are white’ has been subjected to the widest scrutiny, but Willem has just returned from a new-found continent

Even if Alf’s belief was soundly based on pragmatic induction, it would be prudent for him to revise his opinion, since his induction – of whatever kind – was clearly based on too small an epoch.

Analysis

We can split conventional induction into three parts:

  1. Modelling the data.
  2. Extrapolating, using the models.
  3. Consider predictions based on the extrapolations.

The final step is usually implicit in induction: it is usually supposed that one should always take an extrapolation to be a prediction. But there are exceptions. (Suppose that two airplanes are flying straight towards each other. A candidate prediction would be that they would pass infeasibly close, breaching the aviation rules that are supposed to govern the airspace. Hence we anticipate the end of the current ‘straight and level’ epoch and take recourse to a ‘higher’ epoch, in this case the pilots or air-traffic controllers. If they follow set rules of the road (e.g. planes flying out give way) then we may be able to continue extrapolating within the higher epoch, but here we only consider extrapolation within a given epoch.)

Thus we might reasonably imagine a process somewhat like:

  1. Model the data.
  2. Extrapolate, using the models.
  3. Establish predictions:
    • If the candidate predictions all agree: Take the extrapolations to be a candidate prediction.
    • Otherwise: Make a possibilistic candidate prediction; the previous ‘state’ has ‘set up the conditions’ for the possibilities.
  4. Establish credibility:
    1. If the candidate predictions are consistent with the epoch, then they are credible.
    2. If not, note lack of credibility.

In many cases a natural ‘null hypothesis’ is that many elements of a hypothesis are independent, so that they be extrapolated separately. There are then ‘holistic’ constraints that need to be applied over all. This can be done as a part of the credibility check. (For example, airplanes normally fly independently but should not fly too close.)

We can fail to identify a credible hypothesis either because we have not considered a wide enough range of hypotheses or because the epoch has ended. The epoch may also end without our noticing, leading to a seemingly credible prediction that is actually based on a false premise. We can potentially deal with all these problems by considering a broader range of hypotheses and data. Induction is only as good as the data gathering and theorising that supports it. 

Complicatedness

The modelling process may be complicated in two ways:

  • We may need to derive useful categories so that we have enough data in each category.
  • We may need to split the data into epochs, with different statistics for each.

We need to have enough data in each partition to be statistically meaningful, while being reasonably sure that data in the same partition are all alike in terms of transition probabilities. If the parts are too large we can get averaged results, which need to be treated accordingly.

Induction and types of complexity

We can use induction to derive a typology for complexity:

  • simple unconditional: the model is given: just apply it
  • simple conditional: check the model and apply it
  • singly complicated: analyse the data in a single epoch against given categories to derive a model, apply it.
  • doubly complicated: analyse the data into novel categories or epochs to drive a model, apply it.
  • complex: where the data being observed has a reflexive relationship with any predictions.

The Cynefin framework  gives a simple – complicated – complex – chaotic sense-making typology that is consistent with this, save that it distinguishes between:

  • complex: we can probe and make sense
  • chaotic: we must act first to force the situation to ‘make sense’.

We cannot make this distinction yet as we are not sure what ‘makes sense’ would mean. It may be that one can only know that one has made sense when and if one has had a succesful intervention, which will often mean that ‘making sense’ is more of a continuing activity that a state to be achieved. But inadequate theorising and data would clearly lead to chaos, and we might initially act to consider more theories and to gather more data. But it is not clear how we would know that we had done enough.

See also

StatisticspragmaticCynefin, mathematics.

David Marsay