## Which Mathematics of Uncertainty for Today’s Challenges?

This is a slight adaptation of a technical paper presented to an IMA conference 16 Nov. 2009, in the hope that it may be of broader interest. It argues that ‘Knightian uncertainty’, in Keynes’ mathematical form, provides a much more powerful, appropriate and safer approach to uncertainty than the more familiar ‘Bayesian (numeric) probability’.

# Issues

## Conventional Probability

There are gaps in the capability to handle both inherent uncertainty and rapid change.

Keynes et al suggest that there is more to uncertainty than random probability. We seem to be able to cope with high volumes of deterministic or probabilistic data, or low volumes of less certain data, but to have problems at the margins. This leads to the questions:

• How complex is the contemporary world?
• What is the perceptual problem?
• What is contemporary uncertainty like?
• How is uncertainty engaged with?

## Probability arises from a definite context

Objective numeric probabilities can arise through random mechanisms, as in gambling. Subjective probabilities are often adequate for familiar, situations where decisions are short-term, with only cumulative long-term impact, at worst. This is typical of the application of established science and engineering where one has a kind of ‘information dominance’ and there are only variations within an established frame / context.

## Contexts

Thus (numeric) probability is appropriate where:

• Competition is coherent and takes place within a stable, utilitarian, framework.
• Innovation does not challenge the over-arching status quo or ‘world view’
• We only ever need to estimate the current parameters within a given model.
• Uncertainty can be managed. Uncertainty about estimates can be represented by numbers (probability distributions), as if they were principally due to noise or other causes of variation.
• Numeric probability is multiplied by value to give a utility, which is optimised.
• Risk is only a number, negative utility.

Uncertainty is measurable (in one dimension) where one has so much stability that almost everything is measurable.

## Probability Theory

Probability theories typically build on Bayes’ rule [Cox] :

P(H|E) = P(H).(P(E|H)/P(E)),

where P(E|H) denotes the ‘likelihood’, the probability of evidence, E, given a hypothesis, H. Thus the final probability is the prior probability times the ‘likelihood ratio’.

The key assumptions are that:

• The selection of evidence for a given hypothesis, H, is indistinguishable from a random process with a proper numeric likelihood function, P( · |H).
• The selection of the hypothesis that actually holds is indistinguishable from random selection from a set {Hi} with ‘priors’ P(Hi) – that can reasonably be estimated – such that
• P(HiÇHj) = 0 for i ¹ j (non-intersection)
• P(ÈiHi) = 1 (completeness).

It follows that P(E) = SiP(E|Hi).P(Hi) is well-defined.

H may be composite, so that there are many proper sub-hypotheses, h Þ H, with different likelihoods, P(E|h). It is then common to use the Bayesian likelihood,

P(E|H) = òh ÞHP(E|h).dP(h|H),

or

P(E|H) = P(E|h), for some representative hypothesis h.

In either case, hypotheses should be chosen to ensure that the expected likelihood is maximal for the true hypothesis.

Bayes noted a fundamental problem with such conventional probability: “[Even] where the course of nature has been the most constant … we can have no reason for thinking that there are no causes in nature which will ever interfere with the operations the causes from which this constancy is derived.”

# Uncertain in Contemporary Life

## Uncertainty arises from an indefinite context

Uncertainty may arise through human decision-making, adaptation or evolution, and may be significant for situations that are unfamiliar or for decisions that may have long-term  impact. This is typical of the development of science in new areas, and of competitions where unexpected innovation can transform aspects of contemporary life. More broadly still, it is typical of situations where we have a poor information position or which challenge our sense-making, and where we could be surprised, and so need to alter our framing of the situation. For example, where others can be adaptive or innovative and hence surprising.

## Contexts

• Competitions, cooperations, collaborations, confrontations and conflicts all nest and overlap messily, each with their own nature.
• Perception is part of multiple co-adaptations.
• Uncertainty can be shaped but not fully tamed. Only the most careful reasoning will do.
• Uncertainty and utility are imprecise and conditional. One can only satisfice, not optimise.
• Critical risks arise from the unanticipated.

## Likelihoods, Evidence

In Plato’s republic the elite make the rules which form a fixed context for the plebs. But in contemporary life the rulers only rule with the consent of the ruled and in so far as the rules of the game ’cause’ (or at least influence) the behaviour of the players, the participants have reason to interfere with causes, and in many cases we expect it: it is how things get done. J.M. Keynes and I.J. Good (under A.M.Turing) developed techniques that may be used for such ‘haphazard’ situations, as well as random ones.

The distinguishing concepts are: The law of evidence; generalized weight of evidence (woe) and iterative fusion.

If datum, E, has a distribution f(·) over a possibility space, , then distributions g(·) over ,

òlog(f(E)).f(E )  ³ òlog(g(E)).f(E).

I.e. the cross-entropy is no more than the entropy. For a hypothesis H in a context, C, such that the likelihood function g = PH:C is well-defined, the weight of evidence (woe) due to E for H is defined to be:

W(E|H:C) º log(PH:C (E)).

Thus the ‘law of evidence’: that the expected woe for the truth is never exceeded by that for any other hypothesis. (But the evidence may indicate that many or none of thehypotheses fit.) For composite hypotheses, the generalized woe is:

W(E|H:C) º suph ÞH {W(E|h:C)}.

This is defined even for a haphazard selection of h.

Let ds(·) be a discounting factor for the source, s [Good]. If one has independent evidence, Es, from different sources, s, then typically the fusion equation is:

W(E|H:C,ds) £ Ss{ds (W(Es |H:C))},

with equality for precise hypotheses. Together, generalized woe and fusion determine how woe is propagated through a network, where the woe for a hypothesis is dependent on an assumption which itself has evidence. The inequality forces iterative fusion, whereby one refines candidate hypotheses until one has adequate precision. If circumstantial evidence indicates that the particular situation is random, one could take full account of it, to obtain the same result as Bayes, or discount [Good].

In some cases it is convenient, as Keynes does, to use an interval likelihood or woe, taking the infimum and supremum of possible values. The only assumption is that the evidence can be described as a probabilistic outcome of a definite hypothesis, even if the overall situation is haphazard. In practice, the use of likelihoods is often combined with conjectural causal modelling, to try to get at a deep understanding of situations.

# Examples

## Crises

Typical crisis dynamics

Above is an informal attempt to illustrate typical crisis kinematics, such as the financial crisis of 2007/8. It is intended to capture the notion that conventional probability calculations may suffice for long periods, but over-dependence on such classical constructs can lead to shocks or crises. To avoid or mitigate these more attention should be given to uncertainty [Turner].

## An ambush

Uncertainty is not necessarily esoteric or long-term. It can be found wherever the assumptions of conventional probability theory do not hold, in particular in multilevel games. I would welcome more examples that are simple to describe, relatively common and where the significance of uncertainty is easy to show.

Deer need to make a morning run from A to B. Routes r, s, t are possible. A lion may seek to ambush them. Suppose that the indicators of potential ambushes are equal. Now in the last month route r has been used 25 times, s 5 times and t never, without incident. What is the ‘probability’ of an ambush for the 3 routes?

Let A=“The Lion deploys randomly each day with a fixed probability distribution, p”. Here we could use a Bayesian probability distribution over p, with some sensitivity analysis.

But this is not the only possibility. Alternatively, let B =“The Lion has reports about some of our runs, and will adapt his deployments.” We could use a Bayesian model for the Lion, but with less confidence. Alternatively, we could use likelihoods.

Route s is intermediate in characteristics between the other two. There is no reason to expect an ambush at s that doesn’t apply to one of the other two. On the other hand, if the ambush is responsive to the number of times a route is used then r is more likely than s or t, and if the ambush is on a fixed route, it is only likely to be on t. Hence s is the least likely to have an ambush.

Consistently selecting routes using a fixed probability distribution is not as effective as a muddling strategy [Binmore] which varies the distribution, supporting learning and avoiding an exploitable equilibrium.

# Concluding Remarks

Conventional (numeric) probability, utility and rationality all extrapolate based on a presumption of stability. If two or more parties are co-adapting or co-evolving any equilibria tend to be punctuated, and so a more general approach to uncertainty, information, communication, value and rationality is indicated, as identified by Keynes, with implications for ‘risk’.

Dave Marsay, Ph.D., C.Math FIMA, Fellow ISRS

# References:

Bayes, T. An Essay towards solving a Problem in the Doctrine of Chances (1763), Philosophical Transactions of the Royal Society of London 53, 370–418. Regarded by most English-speakers as ‘the source’.

Binmore, K, Rational Decisions (2009), Princeton U Press. Rationality for ‘muddles’, citing Keynes and Turing. Also http://else.econ.ucl.ac.uk/papers/uploaded/266.pdf .

Cox, R.T. The Algebra of Probable Inference (1961) Johns Hopkins University Press, Baltimore, MD. The main justification for the ‘Bayesian’ approach, based on a belief function for sets whose results are comparable. Keynes et al deny these assumptions. Also Jaynes, E.T. Probability Theory: The Logic of Science (1995) http://bayes.wustl.edu/etj/prob/book.pdf .

Good, I.J. Probability and Weighting of Evidence (1950), Griffin, London. Describes the basic techniques developed and used at Bletchley Park. Also Explicativity: A Mathematical Theory of Explanation with Statistical Applications (1977) Proc. R. Soc. Lond. A 354, 303-330, etc. Covers discounting, particularly of priors. More details have continued to be released up until 2006.

Hodges, A. Alan Turing (1983) Hutchinson, London. Describes the development and use of ‘weights of evidence’, “which constituted his major conceptual advance at Bletchley”.

Keynes, J.M. Treatise on Probability (1920), MacMillan, London. Fellowship essay, under Whitehead. Seminal work, outlines the pros and cons of the numeric approach to uncertainty, and develops alternatives, including interval probabilities and the notions of likelihood and weights of evidence, but not a ‘definite method’ for coping with uncertainty.

Smuts, J.C. The Scientific World-Picture of Today, British Assoc. for the Advancement of Science, Report of the Centenary Meeting. London: Office of the BAAS. 1931. (The Presidential Address.) A view from an influential guerrilla leader, General, War Cabinet Minister and supporter of ‘modern’ science, who supported Keynes and applied his ideas widely.

Turner, The Turner Review: A regulatory response to the global banking crisis (2009). Notes the consequences of simply extrapolating, ignoring non-probabilistic (‘Knightian’) uncertainty.

Whitehead, A.N. Process and Reality (1929: 1979 corrected edition) Eds. D.R. Griffin and D.W. Sherburne, Free Press. Whitehead developed the logical alternative to the classical view of uniform unconditional causality.

## ESP and significance

‘Understanding Uncertainty’ has a blog (‘uu blog’) on ESP and significance. The challenge for those not believing in ESP is an experiment which seems to show ‘statistically significant’ but mild ESP. This could be like a drug company that tests lots of drugs until it gets a ‘statistically significant’ result, but from the account it seems more significant than this.

The problem for an ESP atheist who is also a Bayesian is in trying to interpret the result of a significance test as a (subjective) probability that some ESP was present, as the above blog discusses. But from a sequential testing point of view (e.g. of Wald) we would simply take the significance as a threshold which stimulates us to test the conclusion. In typical science one would repeat the experiment and regard it as significant if the result was not repeated. But with ESP the ‘aura’ of the experimenter or place may be significant, so a failure by others to replicate a result may simply mean that only sometimes is ESP shown in the experimental set-up. So what is a ‘reasonable’ acceptance criterion?

Jack Good discussed the issues arising from ESP in some detail, including those above. He developed the notion of ‘weight of evidence’, which is the log of the appropriate likelihood ratio. There are some technical differences to the approach of the ‘uu blog’. They offer some advantages.

If e is the evidence/data obtained from an experiment and h is a hypothesis (e.g. the null hypothesis) then P(e|h) denotes the likelihood, where P() is the (Bayesian) probability. To be well-defined the likelihood should be entailed by the hypothesis.

One problem is that the likelihood depends on the granularity with which we measure the data, and so – on its own – is meaningless. In significance testing one defines E(e) to be the set of all data that is at least as ‘extreme’ as e, and uses the likelihood P(E(e)|h) to determine ‘1-significance’. But (as in ‘uu blog’) what one really wants is P(¬h|e).

In this experiment one is not comparing one theory or model with another, but a statistical ‘null hypothesis’ with its complement, which is very imprecise, so that it is not clear what the appropriate likelihood is. ‘uu blog’ describes the Bayesian approach, of having prior distributions as to how great an ESP effect might be, if there is one. To me this is rather like estimating how many angels one could get on a pin-head. An alternative is to use Jack Good’s ‘generalized likelihood’. In principal one considers all possible theories and takes the likelihood of the one that best explains the evidence. This is then used to form a likelihood ratio, as in ‘uu blog’, or the log likelihood is used as a ‘weight of evidence, as at Bletchley Park. In this ESP case one might consider subjects to have some probability of guessing correctly, varying the probability to get the best likelihood. (This seems to be about 52% as against the 50% of the null hypothesis.) Because the alternative to the null hypothesis includes biases that are arbitrarily close to the null hypothesis, one will ‘almost always’ find some positive or negative ESP effect. The interesting thing would be to consider the distribution of such apparent effects for the null hypothesis, and hence judge the significance of a result of 52%.

This seems a reasonable thing to do, even though there may be many hypotheses that we haven’t considered and so our test is quite weak. It is up to those claiming ESP to put forward hypotheses for testing.

A difficulty of the above procedure is that investigators and journals only tend to report positive results (‘uu blog’ hints at this). According to Bayesians one should estimate how many similar experiments have been done first and then accept ESP as ‘probable’ if a result appears sufficiently significant. I’m afraid I would rather work the other way: assess how many experiments there would have to be to make an apparently significant result really significant, and then judge whether it was credible that so many experiments had been done. Even if not, I would remain rather cynical unless and until the experiment could be refined to give a more definite and repeatable effect. Am I unscientific?

Dave Marsay

## Knightian uncertainty and epochs

Frank Knight was a follower of Keynes who emphasised and popularised the importance of ‘Knightian uncertainty’, meaning uncertainty other than Bayesian probability. That is, it is concerned with events that are not deterministic but also not ‘random‘ in the same sense as idealised gambling mechanisms.

Whitehead‘s epochs, when stable and ergodic in the short-term, tend to satisfy the Bayesian assumptions, whereas future epochs do not. This within the current epoch one has (Bayesian) probabilities, longer term one has Knightian uncertainty.

### Example

Consider a Casino with imperfect roulette wheels with unknown biases. It might reasonably change the wheel (or other gambling mechanism) whenever a punter seems to be doing infeasibly well. Even if we assume that the current wheel has associated unknown probabilities that can be estimated from the observed outcomes, but the longer-term uncertainty seems qualitatively different. If there are two wheels with known biases that might be used next, and if we don’t know which is to be used then, as Keynes shows, one needs to represent the ambiguity rather than being able to fuse the two probability distributions into one. (If the two wheels have equal and opposite biases then by the principle of indifference a fused version would have the same probability distribution as an idealised wheel, yet the two situations are quite different.)

### Bayesian probability in context

In practice, the key to Bayesian probability is Bayes’ rule, by which the estimated probability of hypotheses is updated depending on the likelihood of new evidence against those hypotheses. (P(H|E)/P(H’|E) = {P(E|H)/P(E|H’)}•{P(H)/P(H’)}.) But estimated probabilities depend on the ‘context’ or epoch, which may change without our receiving any data. Thus, as Keynes and Jack Good point out, the probabilities should really be qualified by context, as in:

P(H|E:C)/P(H’|E:C) = {P(E|H:C)/P(E|H’:C)}•{P(H:C)/P(H’:C)}.

That is, the results of applying Bayes’ rule is conditional on our being in the same epoch. Whenever we consider data, we should not only consider what it means within our assumed context, but whether it has implications for our context.

Taking a Bayesian approach, if G is a global certain context and C a sub-context that is believed but not certain, then taking

P(E|H:G) = P(E|H:C)  etc

is only valid when P(C:G) ≈ 1,

both before and after obtaining E. But by Bayes’ rule, for an alternative C’:

P(C|E:G)/P(C’|E:G) = {P(E|C:G)/P(E|C’:G)}•{P(C:G)/P(C’:G)}.

Thus, unless the evidence, E, is as least as likely for C as for any other possible sub-context, one needs to check that P(C|E:G) ≈ 1. If not then one may need to change the apparent context, C, and compute P(H|E:C’)/P(H’|E:C’) from scratch: Bayes’ rule in its common – simple – form does not apply. (There are also other technical problems, but this piece is about epochs.) If one is not certain about the epoch then for each possible epoch one has a different possible Bayesian probability, a form of Knightian uncertainty.

### Representing uncertainty across epochs

Sometimes a change of context means that everything changes. At other times, some things can be ‘read across’ between epochs. For example, suppose that the hypotheses, H, are the same but the likelihoods P(E|H:C) change. Then one can use Bayes’ rule to maintain likelihoods conditioned on possible contexts.

### Information

Shannon‘s mathematical information theory builds on the notion of probability. A probability distribution determines an entropy, whilst information is measured by the change in entropy.  The familiar case of Shannon’s theory (which is what people normally mean when they refer to ‘information theory’) makes assumptions that imply Bayesian probability and a single epoch.  But in the real world one often has multiple or ambiguous epochs. Thus conventional notions of information are conditional on the assumed context. In practice, the component of ‘information’ measured by the conventional approach may be much less important than that which is neglected.

### Shocks

It is normally supposed that pieces of evidence are independent and so information commutes:

P(E1+E2|H:C) = P(E1|H:C).P(E2|H:C) = P(E2|H:C).P(E1|H:C) = P(E2+E1|H:C)

But if we need to re-evaluate the context, C, this is not the case unless we re-evaluate old data against the new context. Thus we may define a ‘shock’ as anything that requires us to change the likelihood functions for either past or future data. Such shocks occur when the data had been considered unlikely according to an assumption. Alternatively, if we are maintaining likelihoods against many possible assumptions, a shock occurs when none of them remains probable.