Uncertainty and Probability

An attempt to explain uncertainty, starting from numeric probability. It identifies various types of uncertainty, but they get rationalised towards the end.

Work in progress.

The need for an understanding of uncertainty and probability

We often have to act without absolute certainty. We therefore need a theory of uncertainty, so that we can express and take account of it. The term probability is often used for a one-dimensional indicator of uncertainty, as if the appropriate representation of uncertainty was just a number. The theory of such a probability is well developed, and according to the principle of probabilistic determinism ‘rational’ decisions should only take account of this numeric component of uncertainty, but its universal applicability is contentious.

Doubts that probability is ‘just a number’

Probability as belief, mathematics or science

From a mathematical point of view probability theory, like any theory, only shows that certain statements (theorems) follow from certain assumptions (axioms). No theory can assert its own assumptions. Hence the use of probability theory ought to be conditioned on the assumption that its axioms are valid, principally that uncertainty can be represented by just a number. Thus a method based on numbers that uses mathematical operators and ‘looks mathematical’ is computational, but is not strictly mathematical unless it is founded on some theory, and hence is conditional on its axioms.

A scientific approach might be thought to be able justify the axioms, but there are two problems:

  • Science can only prove theorems under set contexts, so cannot prove universality.
  • The scientific method depends on the notion of probability.

Thus a conventional scientific study would depend on the theory that it is trying to prove (‘circular reasoning’) and would depend on being able to cover examples of all kinds of uncertainty, a challenge, particularly when most scientist have been trained to suppose that there is only one (as measured by a number).

Probability paradoxes

The so-called Ellsberg paradox casts doubt on the principle of probabilistic determination:

Suppose that I have two urns, both containing 10 balls that are either black or white and otherwise identical. The first urn has equal numbers of balls of each colour. We do not know the mix in the second urn. A ball is to be drawn at random from each urn.  Conventionally, they should both have a probability of 0.5 of being black, so that one should be indifferent between which of them is used for a decision based on the colour of the ball. But people tend to prefer the urn which is known to be fair. 

The Allais paradox is similar. They show that many people are irrational, in the sense of not conforming to the principle of probabilistic determinism, but it is usually supposed that it is people who are wrong, not the principle. This may be because there is no widely understood alternative, and (it is supposed) one must have accepted principles. 

It is non-controversial that P(Black)=0.5 in the first case, and that one can be almost certain, so that the meta-probability is approximately 1.0. Similarly,
          P(Black|Equal numbers of balls)=0.5 in the second case.
But the  unconditional probability, P(Black), is much less clear, and may perhaps merit a lower meta-probability. If we take account of the meta-probability, a preference for a high meta-probability is not paradoxical. Similarly, we may think it preferable to use a coin that is known to be approximately fair than one of unknown provenance. This might explain away the supposed paradoxes, but to get to some general guidance we need to develop the theory further.

Technicalities

There are many flavours of probability theory, each of which seems natural to its advocates, Any discussion of alternatives will seem unduly complicated if it seeks to compare and contrast the alternative with these many flavours.

Some flavours of probability only apply to ‘events’, others to ‘statements’, including but not limited to statements about events. But if we imagine a world with some sort of falsification mechanism, then the falsification of a statement would be an event. Thus a notion of probability that applies to events applies to the falsification of statements, and hence ‘the probability of a statement having been falsified’ (or not) is meaningful, whereas the probability of a statement being ‘true’ may not be meaningful. Similarly, if we have a meaningful notion of decision-making, we can ‘operationalise’ both the notion of the probability of a decision and the notion of the probability of a decision being ‘effective’ against some given criterion.

More generally, we can extend the event-based notion of probability to any virtual world that we can – at least in principle – simulate. This is enough for the purposes of these notes.

From probability to probability of probability

If probabilities exist, what about probabilities of probabilistic statements?

Bootstrapping numeric probability

If ‘the probability of X’ is defined for all statements X then it is also defined when X(P|Y) is “the probability rule P is appropriate in situation Y”, for some suitable notion of appropriateness. Informally, if one has a universal probability rule of the kind that is normally asserted to exist, then on can define:

 Degree of certainty = probability × the probability that the probability rule is appropriate.

More formally,

D(S|Y) = (P(S) , P(“P is appropriate when applied to S in situation Y”)).

These may be called the probability and meta-probability components of the degree of certainty. If P( ) is well-behaved mathematically, then so is D( ). If the probability P( ) is universally defined, then so is D( ). But it is not obvious what the basis would be for assigning a number to the meta-probability.

Conventional probability as a special case

In effect the usual (‘pragmatic’) assertion is that having made an estimate of the probability component then we must be certain of its correctness. This is implied by the principle of probabilistic determinism, for example. Some people, like me, are irrational to the extent that we may regard even our best estimates as being of doubtful reliability in circumstances in which we have a weak track record.

Initial Probability Extensions

The statement ‘P(Black|Equal numbers of balls)=0.5’ is a tautology and hence has meta-probability 1. But to say that something has probability 1 only implies that it is ‘almost always’ true, not that it is unconditionally true. Hence we need to distinguish between the real number 1.0 (almost always) and the integer 1 (always), and similarly between 0.0 and 0. 
Now suppose I have not yet put the balls in the urn. What is the probability that a black ball will be drawn at random from the urn? What if the balls that I am going to put into the urn depend on your answer? Is the conventional concept of probability appropriate in this case? Perhaps we should allow a probability to be ‘u’ for undefined.

Now, for tautologies we have a probability of 1, for tautological probabilities we can have a meta-probability of 1 and for problematic probabilities we can have a value of ‘u’. The extension of the usual axioms of probability theory is straightforward.

Initial Meta-probability Extensions

There seems good grounds to doubt that the meta-probability is always 1 (or 1,0), and so we need to distinguish between them. If B is a belief and P(X|B) makes sense, then we might make the conditionality implicit, so that we have P(X) with the dependence on B part of the meta-probability. Thus the more assumptions a probability depends on the ‘lower’ the meta-probability. These assumptions or caveats accumulate, and yield modified axioms for a more suitable probability theory, more appropriate to meta-probability.

Logical Probability

For urns and fair coins we can sometimes say with meta-probability 1 what the ‘true’ probability is. Let U(n,m) be an urn with n white and m black balls from which we are to draw a ball ‘at random’. Then

P(Black|U(n.m))=m/(n+m) (with meta-probability 1).

We might say that for the second urn

P(Black) = ‘u’.

If we drew out two balls, then we would also have

P(Black,Black) = ‘u’,

but this conceals the fact that:

P(Black,Black) ≤ P(Black), for example

Suppose now that we draw out a ball and see that it is black and weighs 100g, and then replace it, noting that the bag weighs just over 1000g. Then it is natural to say that:

P(Black) ≥ 0.1, or that P(Black) = p, where p ≥ 0.1.

That is, we suppose that there is a conventional objective (‘aleatory’) probability that we put subjective bounds on, based on what we ‘know’. This is the approach of Boole. It treats probability algebraically rather than arithmetically. Such analysis puts bounds on the possible range of probabilities, which is more informative than ‘u’. In some cases (as above) probabilities depend on themselves, reflexively. Such probabilities might either be left as ‘u’, or conditions that break the cycles introduced, yielding conditional probabilities.

Precise probabilities, such as P(Y|X)=p, fit all the rules of normal (aleatory) probability, such as P(X&Y)=P(Y|X).P(X). These induce similar rules for the bounds.

Statistical Probability

If one has a sample of which a subset of size n satisfies a condition X, of which a further subset of size m satisfy a further condition Y, one might say that P(Y|X) = m/n. A statistical estimate of ‘P(Y|X) = p’ would most naturally be interpreted as having come from such a sample. Such an estimate is in some sense more ‘accurate’ if it comes from a larger sample.

If P(X&Y), P(Y|X) and P(X) are all derived from the same sample then one has
          P(X&Y)=P(Y|X).P(X).
More generally, if all the probabilities are derived from the same sample, then they fit all the normal rules. Otherwise, they may not.

Statistics are often regarded as an indication of some underlying proportion or aleatory probability, as when a proportion of a population is randomly sampled. Thus it is natural to associate some indication of the ‘accuracy’ of the statistic. One way to do this is to give a range, such as a confidence interval, but there are alternatives.

Is numeric probability enough for decision-making?

According to the principle of probabilistic determinism, we should ignore the uncertainty and just take account of the numeric probability. There are various justifications that have been given for this.

  • Most simply make assumptions that logically imply that what is wanted is just a number and then justify the other axioms.
  • Less formal arguments essentially show that it would be convenient if uncertainty could be represented by just a number, but do not show that what is convenient is always effective.
  • Other arguments deal with special cases, as in gambling, without considering the rest.

If we regard the principle of probabilistic determinism as not quite certain then what we want is an indication of the lack of certainty in a proposition (e.g., that an event can happen), suitable to inform decision-making. Is this just a number?

One justification for the conventional view considers gambling situations and supposes that our aim is to maximize our gain in the long run. If we know the true (aleatory) probabilities then making many small bets based on them gives a yield that tends to the best expected value. We can adapt this theory to imprecise probabilities (as in the Ellsberg example). For any given betting strategy there will be a range of possible expected values, and our yield will tend to some value within this range. The principal of probabilistic determinism implies that the imprecise probabilities should – somehow – be reduced to precise probabilities which are then used for betting. But in the Ellsberg example (above)  this could lose to a loss: it may be better not to gamble.

In general, if there is a ‘do-nothing’ option whose outcomes are satisfactory to a high probability and meta-probability then one may wish to take account of the what could go wrong, even if the measure is not probabilistic. The range of possible probabilities and the meta-probabilities matter.

Attitudes to risk

A “neutral” attitude to risk would be to act on our probability estimate, ignoring the meta-probability. But is a decision-maker well-informed if they are not warned of a poor meta-probability?

A cautious attitude to risk would be to consider the worst-case, as above. This may be the lowest reasonable probability estimate for a gain or the highest reasonable probability estimate for a loss.  Thus useful meta-probabilities would be “lowest reasonably possible” and “highest reasonably possible”.

The principal of probabilistic determination assumes that the decision maker has a neutral attitude to risk, which either implies that they are taking a short-term view of the situation, or are in a stable situation (‘ergodic’) where probability estimates can be relied upon. It may sometimes be preferable to provide more informative reports on uncertainty, and allow them to apply whatever attitude to risk they deem appropriate. This leads to the following notions, slightly more complicated than simple numbers.

Normalising probabilities

Precise probabilities

While we may regard Boole’s notion of probability as superior to Bayes’ numeric notion, it is nonetheless helpful to decision-makers to have relatively simple probabilities. This may help them rule out option without getting in to the complications of Boole, which can then be focused on a few options and confounding factors.

Bayesian probabilities are normally regarded as subjective. Given a set of relatively objective constraints on probabilities, it is usual to generate a set of coherent numeric probabilities by using principles such as the principle of indifference, making whatever assumptions are needed to resolve ambiguity, but this can be controversial.

A less subjective way to turn Boolean probabilities into Bayesian one is as follows. First, identify the maximal partition of the space of interest. Determine the sums over the partition of the upper and lower probabilities. These will be at least 1.0 and at most 1.0 respectively.  Determine the interpolation factor, λ, that corresponds to a sum of 1.0. Use λ to interpolate between the bounds for each member of the partition. These probabilities then sum to 1.0. Use the normal probability rules to generate probabilities for composite sub-sets.

This yields a probability distribution that in some sense represents the original, Boolean, distribution. It is not guaranteed that it will satisfy all the constraints of the original distribution, but for classes of combinatorial and statistical distributions, such as those considered here, it will. If not, one will need to identify a solution that is not too close to any extreme, perhaps using least-squares method. In principle this could require an arbitrary (subjective) tie-break, but in practice this may not be a problem.

Individual probabilities

If I satisfy criterion X and it is known that P(Z|X)=p then I may think that P(Z)=p for me. But I may also satisfy Y with P(Z|Y)=q. As Fisher pointed, this is a problem for statisticians as well as for people trying to interpret statistics.

In effect, many people suppose that if it has been stated that  P(Z|X)=p then P(Z|X&Y)≈p for all Y. But the conventional approach to probability requires us to give the Bayesian average probability for a set that may cover markedly different probabilities.

One way around this would to develop a language that distinguished between a probability that – as far as is known – is reasonably homogenous, and one that is simply an average. This indication would be a kind of meta-probability.

Perhaps better, would be insist that the bounds on the probability for a composite set  are wide enough to cover all known subsets. Thus we might require that:

P(Z|X) (as an interval) contains  P(Z|Y), whenever X contains Y.

This can be achieved by using a partition (as before) and summing intervals. This would not ensure that the ‘true’ probability for an individual would be within the stated constraints, since there could always be some important but rare unidentified factor. But it would ensure that it was not known that the individual probability was outside the claimed range. The meta-probability for an individual might then consider how ‘typical’ they were and how representative of them the sampling process was.

In effect, by applying a probability P(Z|X)=p to an individual, x, one is assuming that x is typical of X, as if they had been randomly selected. If one knowns that x also satisfies Y, and that this is rare for those in X, then this is an unreasonable assumption. By normalising the bounds as above then one ensures that the bounds are reasonable for x, whatever conditions they may have. If one wants more precise bounds then one must identify their condition and apply the appropriate probabilities. (In practice one may ignore some very rare conditions and not this in the meta-probability.)

Some aspects of uncertainty

Ranges and conditions

Explicitly considering all possible probability distributions may not be efficient. It is common to consider only extreme and typical (“expected”) cases, but this is not always effective: it may be that a course of action is clearly superior but that one cannot tell this considering only the extremes, because the calculus can only give bounds, losing precision with computation. Thus one often needs to split the main context into sub-contexts recursively until it is clear which options are best.

If there is only one definite context, if the decision-maker has already declared which context should be assumed, and if that context determines the probabilities, then the conventional approach, of reporting just a numeric probability is reasonable. But more generally the full ‘degree of certainty’ is often infeasibly complicated, so that the decision-maker would need to be involved in determining which elements of the degree of certainty needed to be resolved. Nonetheless it may be possible to provide a useful summary of the ‘degree of certainty’. This could focus on summarising the likely contexts and their implications for the probabilities, particular the expectations for the activities being considered

Ranges of probabilities

Suppose that Pi() are credible for i in some index set I. Then for a statement S, i→Pi(S) gives the credible probabilities, which may be compared to give the extremes. Again, this is as well defined as the component probabilities and inherits its calculus. It supports both neutral and cautious attitudes to risk. Each probability can be given an associated meta-probability.

We can develop this approach further by associating with each index extra meta-data:

  • A description such as “random”, “con-man”. This could also be represented as a conditional probability, such as P(X|”All above-board”), P(X|”Trick”).
  • The evidence or ‘weight of evidence’ for / against. Thus one might say “Established business” or “Shifty-looking character”.

Thus we might normally take a situation at face value, unless there was some specific evidence, thus setting the bound on what we though reasonable in the circumstances.

Dependence on context and evidence

Although it is conventional to use the notation P(S)  or P(S|R) for a probability, the actual probability depends on the context, C. We may regard the factors in the meta-probability as being a part of the context. Thus we are concerned with P(S:C), to use the full notation. If the specific context is an unknown c as a variable within a larger context C then we have
{P(S|c:C) , P(E|c:C)}c . (This is the probability in the context together with the likelihood of the available evidence in the context).
The conventional approach would be to consider a prior probability P(c:C) to derive a posterior probability P(c|E:C) and hence an averaged P(S:C). But this assumes that the decision-maker is risk-neutral and pre-empts the choice of priors.

If we are to allow the decision-maker to choose the priors and the attitude to risk, then we need the component probabilities and likelihoods. Now, the contexts define the probabilities. If the decision-maker already knows these, then perhaps all that they need are the likelihoods. For a well tested coin one would say that the likelihood that it was fair was uniquely high. For a new coin introduced by a possible trickster one might say that all options were equally likely (fair, double headed, … ). Likelihoods are a form of probability. They are less controversial in the sense that the context, c, needs to be specified sufficiently that the likelihood is well-defined. They do not need to be universal.

Novel Extensions

The above types of non-numeric probabilities are at least implicit in Keynes & Good but have not – to my knowledge – been rationalised.

There appears to be at leaast an 8-fold relationship, which we may attempt to represent as follows:

P(A|B:C) ≤ P(D|E:F) :(G,H).

The notation P( A | B : C ) is as used by Good, with P( A | B ) being the conventional conditional probability, with the last term, C, being the context. The relation P(A|B:C) ≤ P(D|E:F) is then interpreted as in Boole’s approach, but subject to (G,H). G is a condition (such as ‘X is not a cheat’) while H is any ‘workings’ needed to inform the use of the inequality. For example H could be ‘This inequality was assessed by Jim’ or ‘This inequality follows by the … law and inequalities … and …. ‘.

It remains to develop axioms for these – extending the usual ones – and to show that it captures a usefully broader range of uncertainties that can inform a usefully broader range of decisions.

Empirical application

Many people seek a probability that applies to something previously unknown, such as that the probability that the next toss of the coin will be ‘Heads’ is 0.5. But mathematics is never so unconditional. All one can say is that ‘if things carry on as they are then …’. Hence mathematical probability is about a what has been observed. One can say that the observed data is consistent with having been produced by a fair roulette wheel, and perhaps that with so much data no significantly different aleatory model is credible. But one cannot say that the wheel ‘really’ is fair, or that if it is fair no-one will meddle with it.

Conclusion

There is no reason to suppose that a probability that is just a number is always, even in principle, appropriate, or that even if it is that it is practical. Given the choice of as fair coin or an unknown coin, it seems a reasonable strategy to choose the coin that is known to be fair. Over many cases this will give a fair and reasonably predictable outcome. Choosing an unknown coin risks being hoodwinked. One might learn one’s lesson for coins, but a so-called risk-neutral strategy, consistently applied, risks being a consistent loser, for example in an competitive environment. The use of some sort of representation of uncertainty, as discussed above may be preferable. If we have a default safe action it may be sufficient just to say that we don’t know what the probabilities are for other choices.

Dave Marsay

3 Responses to Uncertainty and Probability

  1. Pingback: Quantum Minds | djmarsay

  2. Pingback: Car accident | djmarsay

  3. Pingback: Disease | djmarsay

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.