Good’s Good Thinking

IJ Good Good Thinking: the Foundations of Probability and its Applications U. o Minnesota Press, 1983

This builds on his Probability and the Weighing of Evidence, 1950. Essential for anyone interested in inference under uncertainty, the foundations and limitations of probability theories and data and information fusion, their potential pitfalls, and how to recognize when the going is difficult, and some tips on what to do about it. Builds on Keynes. It is of interest partly because it gives some insights to the ‘modern Bayesian’ approach developed and applied to such great effect at Bletchley Park (where Good was Turing’s statistical advisor), and partly because some contemporary Bayesians claim Good as one of their leading lights.

The book is built around previously published papers: a fresh start might have been more accessible. What follows is my attempt at a ‘big picture’ with highlights.

Introduction

The aim is to support ‘clear reasoning’ about probability, which is used in the Keynesian sense rather than necessarily being linked to a theory of utility maximization.

“… I consider (i) that, even if physical probability [presumably Popper’s propensity] exists, which I think is probable, it can be measured only with aid of subjective (personal) probability [cf Ramsey] and (ii) that it is not always possible to judge whether one subjective probability is greater than another, in other words, that subjective probabilities are only ‘partially ordered’ [cf Keynes]. This approach comes to the same as assuming ‘upper and lower’ subjective probabilities, that is, assuming that a probability is interval valued.”

“… the concepts of probability and weight of evidence … helped greatly in work that eventually led to the destruction of Hitler.”

… . Turing [had] suggested in 1940 or 1941 that certain kinds of experiments could be evaluated by their expected weights of evidence (per observation) ∑pi.log(pi/qi) … . … The concept of weight of evidence completely captures that of the degree to which evidence corroborates a hypothesis. I think it is almost as much an intelligence amplifier as the concept of probability itself … .”

“[A good description of Good’s position] would be ‘centrist’ or ‘eclectic,’ or perhaps ‘left center’ or advocating a Bayes/non-Bayes compromise.”

Part 1. Bayesian Rationality

Ch. 1, Rational Decisions

This links probability to rationality.

“It is worth looking for unity in the methods of statistics, science and rational thought and behaviour … .”

[On Jeffreys’ argument for universal credibilities, i.e. that the probability P(E|H) always exists.] “The only way to asses the cogency of this argument, if it can be assessed at all, is by the methods of experimental science whose justification is by means of a subjective theory.”

“… ‘science deals only with what is objective’ … is false.”

“The principle of maximizing the expected utility can be made to look fairly reasonable in terms of the law of large numbers, provided that none of the utilities are very large. …”

Good provides an initial axiomatization in which probabilities are numbers, but we may have only imprecise estimates of them.

Good highlights developments in scoring and hierarchies of probabilities, and argues that managers of large firms should understand the concept of Bayesian rationality. Sometimes it is rational to make a quick rough and ready decision rather than expend time and effort on determining the ideal decision. Gambling may also be rational, if it gives us pleasure. However, the conventional theory looks less reasonable for critical decisions, where utilities are large.

Ch. 2, Twenty-seven principles of Rationality

Good’s ‘priggish principles’ including:

“8. … state what judgments you have used and which parts of your argument depend on which judgements. The advantage of likelihood is its mathematical independence of initial distributions (priors), and similarly the advantage of weight of evidence is its mathematical independence of the mathematical independence of the initial odds of the null hypothesis. The subjectivist states his judgements whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.”

“10. … evolving or sliding [or dynamic] probabilities … can change in the light of thinking only, without change of empirical evidence [as in a game of chess or cosmology]. … Evolving probabilities are essential for the refutation of Popper’s views on simplicity.

Ch. 3, 46656 Varieties of Bayesians

A classification of variants of Bayesianism, based around 11 factors, including 11. ‘Probability types’, where the ‘priors’ could be parametrized [or conditional]. This leads to non-Bayesian statistics, such as likelihood-ration tests.

Ch. 4, The Bayesian Influence …

“… or How to Sweep Subjectivism under the Carpet”. A wide-ranging survey of the topic, distinguishing his ‘Doogian’ view from that of other Bayesians.

“In some philosophies of rationality, a rational man is defined as one whose judgment of probabilities, utilities … are all both consistent and sharp or precise. Rational men do not exist … [but it] can be regarded as an ideal. …”

“In practice one’s judgments are not so sharp, so that to use the most familiar axioms it is necessary to work with judgements of inequalities. …  We thus arrive at a theory that can be regarded as a combination of the theories espoused by F.P. Ramsey … , and of J.M. Keynes, who emphasised the importance of inequalities (partial ordering) but regarded logical probability as the fundamental concept, … .”

“Since the degrees of belief, concerning events over which he has no control … should surely not depend on whether he intends to use his beliefs in any specific manner, it seems desirable to have justifications that do not mention preferences or utilities. But utilities necessarily come in whenever the beliefs are to be used in a practical problem involving action.”

“… you should make your assumptions clear and you should try to sort out the part that is disputable from the part that is less so. …. the initial probability of the null hypothesis is often disputable. … There is much less dispute about likelihoods. … the bad subjectivists are the people with bad or dishonest judgment and also the people who do not make their assumptions clear … . But, on the other hand there are no good 100% (hidebound) objectivists; they are all bad because they sweep their judgments [under the carpet].”

The weight of evidence in favour of one hypothesis, H, over another, H’, due to evidence, E, is defined as the log of the likelihood ratio, log(P(E|H)/P(E|H’)). Good argues that the likelihoods tend to be better defined than the initial probabilities even for composite hypotheses.

Comment

Good rightly points out that probability estimates depend on theories whose validity depend on subjective assumptions. Hence probability is never just a number, but should always refer to assumptions. In Keynes’ terms, all probabilities are conditional and the conditions ought to be explicit. Good has noted that the assumptions of probability theory make most sense for routine, not critical, decisions, and for decisions that cannot influence what actually happens. Otherwise, Good’s treatment largely mirrors the conventions of probability theories. For example, he supposes that conventional rationality is ‘ideal’. But as Whitehead and Binmore have pointed out, this may not be reasonable in the long-run. Most of Good’s exposition makes most sense as being about extrapolation or projection in the short-run.

Part II. Probability

Ch. 6, Kinds of Probability

“… James Bernouilli [in] his ‘law of large numbers’ states that in n ‘trials’, each with probability p of success, he number of successes will probably be close to pn if n is large. … he tried to apply the theorem to social affairs in which this definition is hardly appropriate. Worse yet: the probability is likely to be variable.”

“… Laplace’s … ‘law of succession’ … leads, for example, to the conclusion that anything that has been going on for a given length of time has probability close to 1/2 of going on for the same length of time again. This does not seem to me too bad a rule of thumb if it is applied with common sense.”

Comment

Good recognizes that statistics drawn from life need not be stable / stationary, but suggests that any stability will probably last for about twice as long as it has so far. This does not seem reasonable when applied to situations in which there are antagonistic learning or evolutionary processes, as in conflict or economics, although it may reflect how people actually think.

Ch. 9, … Hierarchical Bayesian Methodology

“If you can confidently say that a logical probability lies in an interval [a,b] it is natural to think it is more likely to be near the middle of this interval than to the end of it; or perhaps one should convert to log-odds to express a clear preference for the middle. [Taking the middle of the log-odds interval is an invariant rule under addition of the weight of evidence.]

Comment

Applying this rule minimises the impact of using an interval representation. It seems reasonable for natural science, but less so in a social setting. For example, if a competitor has been behaving in a way characterised by a proportion p and we know that they have changed, we may have a bi-modal subjective distribution for the behaviour, centred on p but zero there. It is thus important to distinguish between Good’s mathematics and his tips.

Ch. 14, … Hypothesis Testing

3. The Testing of Hypotheses

Good discusses composite hypotheses:

“H is the logical disjunction ‘H1 or H2 or H3 or … ‘. … The probabilities P(Hi|H) and P(E|Hi) are assumed to exist and P(E|H) can then ‘in principle’ be calculated from the formula

P(E|H) = P(E|H1)P(H1|H) + P(E|H2).P(H2|H) + …

Then the Bayes factor … can in principle be calculated, and its logarithm is the weight of evidence … . … The main objection to this Bayesian approach is of course that it usually difficult to specify the probabilities P(Hi|H) with much precision. The Doogian reply is that we cannot dispense with judgments of probabilities and of probability inequalities; also that non-Bayesian methods also need Bayesian judgments and merely sweep them under the carpet.”

“The likelihood-ratio statistic for testing a hypothesis H “within” a hypothesis H’ (where H’ at least is composite) is defined as maxiP(E|Hi)/maxjP(E|Hj) … .
Notice how the likelihood ratio is analogous to a Bayes factor which would be defined as a ratio of weighted averages of P(E|Hi) and of P(E|Hj). The weights … are verboten by the non-Bayesian.  … one hopes that the Bayesian averages are roughly related to the maximum likelihoods. …  the … likelihood ratio is the non-Bayesian’s way of paying homage to a Bayesian procedure.”

Comment

Good leaves open, without appearing to have given it any consideration, the question of which approach is valid. If a hypotheses is ‘not too composite’ then both will approximate to a statistical hypothesis, and hence the results shouldn’t be too far in error. However, consider the hypotheses ‘fair coin’ and ‘double-sided coin’. For a set, E, of n coin tosses, P(E|Fair) = 0.5n. According to the Bayesian approach, if we regard double Heads as much more probable than double Tails then for short runs P(Head run|Double) ≈ 1, P(Tail run|Double) ≈ 0, P(Mixed|Double) = 0 . Hence a short run of tails will be taken as evidence for ‘Fair’. On the other hand, the likelihood ratio approach suggests a ‘generalized likelihood’, maxiP(E|Hi). One then has P(Head run|Double) = 1, P(Tail run|Double) = 1, P(Mixed|Double) = 0. Thus any run is evidence of Double, which is correct.

Now, the key feature of the Keynes/Turing/Good approach is that (for non-composite hypotheses) the expected weight of evidence for the correct hypothesis is never exceeded by that for any other hypothesis. Any errors are due to statistical variability due to lack of evidence. If one uses the generalized likelihood then one has the same assurance even for composite hypotheses. This is not the case for the Bayesian likelihood that Good recommends. We can illustrate this by imagining two urns. One contains fair coins, the other double-sided coins. An urn is selected. Repeatedly, a coin is selected from that urn, tossed twice, and returned. We want to determine which urn was selected. If we thought the double coins were very likely double heads but they were actually an even mix of double heads and double tails, then  using the Bayesian approach we would become increasingly convinced that the double coins were fair. Thus, unlike the generalized likelihood approach, the Bayesian approach can tend to be wrong even with an arbitrarily large number of samples.

To be more definite, in case the example is not yet clear, let u1 = {Fair} be an urn containing a fair coin, u2 = {HH,TT} be an urn containing both a double-headed and a double-tailed coin, let H1 = {Fair, …} be the hypothesis that the selected urn contains only fair coins, and – alternatively – let H2 = {HH, … , TT …} be the hypothesis that the urn contains a mix of double-headed coins and double-tailed coins.

Using generalized likelihood, P(HH|H1)=P(TT|H1)=0.25 and P(HH|H2)=P(TT|H2) = 1, so the weight of evidence always supports the true hypothesis, H2, for urn u2.

The contemporary Bayesian approach gives the same result for H1, since it is a ‘statistical hypothesis’. But H2 is not. The approach requires an estimate of the relative proportion of HH and TT in the urn, so that the lkelihoods are P(HH|H2) = e, P(TT|H2) = 1-e for some definite e. If the evidence, E, has h double-heads and t double-tails then the Bayesian likelihood is P(E|H2) = eh.(1-e)t. Since there is actually one coin of each type, if e=0.1 (say) then for most sequences longer than 10 (say), the likelihood P(E|H2) will be much smaller than P(E|H1), so that the Bayesian posterior probability is most often greatest for the wrong hypothesis. The problem is the reliance on the estimation of e. Once a Bayesian has made such an estimate of ‘the prior’, they don’t ever change it: a serious limitation. The problem also arises if TT is actually very infrequent but thought to be as common as HH.

Ch. 15, Explicavity …

VI. Induction and the testing of theories

This cites Keynes’ results on induction.:

If P(H|G)>0, and if E1, E2, … are all implied by H·G,

1)        P(En+1En+2… En+m|E1E2… EnG) → 1 as m and n tend to infinity in any manner.

2)       If moreover P(E1 … En|Hc·G) → 0 as n → ∞  (where Hc is the complement of H) then

P(H|E1E2… EnG) → 1
[The first probability is a generalized likelihood, so the assumption is that all possibilities other than H tend to be ruled out.]

Good shows that, consequently, a printed language cannot be a finite order Markov process. He goes on:

“Provided that we are prepared to estimate probabilities … conditional on … reasonably well specified theories, then the Second Theorem of Induction can be used to achieve near certainty. But … this is conditional .. . [Often] we cannot be sure that we have overlooked some other theories, so that the First Theorem of Induction must be used … it refers to the near certainty of observational results deduced from theories and not to the theories themselves.”

These are special cases of estimating probabilities in multinomial distributions.

Comment

Here induction is conditional on a theory, G. One needs to consider separately whether G will continue to hold. To be able to induce G one would need some other theory, and so on, as in Whitehead. It is implicit that there are some ‘rules of the game’ that will continue to hold.

This notion of induction does not apply to emergent properties, since then the evidence is not implied by the hypotheses.

Part IV. Information and Surprise

Ch. 16, … Describing and Measuring Uncertainty

2. Scientific Theories

“… The function of the theory is to introduce a certain amount of objectivity into your subjective body of judgements, to act as shackles on it, to detect inconsistencies in it, and to increase its size by the addition of discernments.”

Comment

The scientist, as scientist, is interested in both checking theory and in extending it ‘by the addition of discernments’. An engineer, in contrast, is more interested in applications, which may involve extensions. The theory is, at least by default, taken ‘as read’. Sometimes a scientist takes on the role of engineer, as in many companies. In this case they may be scientistic: appearing to be scientific but with engineering motivations. It is important not to be mislead about the extent to which a theory gains credence by such applications.

(5). Justification of the principle of rational behavior

“The simplest attempt to justify the principle of maximizing expected utilities is in terms of some form of the law of large numbers. This law is inapplicable if some of the utilities are very large. … Nevertheless … the principle of rational behavior is consistent with the theory of probability. .. and then the theory can be regarded as containing an implicit definition of probability and utility … .”

Comment

This echoes Shackle’s concerns about critical decisions.  Rationality is credible, but not reasonable.

Part V. Causality and Explanation

Ch. 21, A Causal Calculus

“… starting from very reasonable desiderata, there is a unique meaning … that can be attached to ‘the tendency of one event [F] to cause another one [E].’ … ”

“[Namely,] the weight of evidence against F if E does not happen.”

Comment

As before, the weight of evidence depends on the prevailing theory, G. Thus in practice we may cite either E or G as the cause.

Ch. 23, Explicavity

Good introduces a notion for dynamic probability:

“Here P0(E) is the initial probability of E, judged before H is brought to your attention, whereas P1(E|H) is the conditional probability of E given after H is brought to your attention.”

“… it can easily happen that P1(E|H) … is much larger than P0(E).”

This is used in determining explicavity, relating to weights of evidence etc. Some common-sense equalities and inequalities do not hold for such dynamic probabilities.

Good also introduces the notion of a factor γ used to discount prior probabilities. The extremes (0, 1) are familiar, but:

“Neither of these two philosophies is tenable … although reasonably judged to be good enough in most circumstances.”

“… the size of γ depends on how objectionable we regard it to have clutter, or to “multiply entities without necessity”.

This is a reference to Ockham’s razor. Later:

“The sharpened razor is the recommendation to choose the hypothesis that maximizes the explicavity with respect to E, for all known evidence.”

If one is taking repeated trials, as in a typical experiment, it turns out that γ plays no role, as long as it is not 1 (no discounting). But the value of  γ makes a difference more generally.

Overall Comment

Good has linked the mathematics of Keynes’ Treatise to the metaphysics of Popper, revealing some important implicit assumptions. He differs from Keynes in considering subjective probabilities, but they are tempered by evidence and a concern for hidden assumptions, and hence not so very far in spirit from Keynes’ logical probabilities. Good notes that the commonplace notion of rationality links to the commonplace notion of probability and to commonplace notions of science, and that while such commonplace notions often seem appropriate, there are exceptions.

Keynesian induction is key. If the world is considered to be law-like, with some set of general rules G ,and there is a refinement, H, with non-zero probability, then a continuing set of observations implied by H tends to be almost certain. The problem is to find a set of rules (G’, H’) that agree on extrapolations of observations. Thus, if we suppose that the system being observed is rational, it is rational to apply conventional, rational, science. This is the situation in traditional natural science but not, for example, the social sciences. A more specialised case is engineering, where the laws, G, are supposed to be known and the refinement, H, is some set of parameters. In this case the parameters can be taken to be ‘confirmed’ in so far as the observations that are implied by them are confirmed. We need not worry about whether the theory is ‘correct’, but only whether it gives the correct answers.

A stronger form of induction is where alternative theories are increasingly falsified. In this case our theory becomes increasingly probable. But such situations are not to be found, except where there is an overarching supra-theory that constrains the choice of  sub-theory, so the correctness of the sub-theory is conditional on the supra-theory. There is no way to justify an unconditional probability for an empirical supra-theory: one must have recourse to ideology.

The book is a series of papers, which Good never quite pulls together. He notes that the law of large numbers in its usual form is not valid. But the mathematical version, concerned with samples from probability distributions, is valid, as is the maximality of the expectation of the generalized likelihood for a true hypothesis. Thus we can use the theory to identify the best model out of a set of models, even if we can’t be sure that reality resembles our model. We can also check that reality is consistent with that model.

We can extrapolate from the past, but we cannot be sure that t current situation will endure. Often we will want nested models, where our current hypothesis H is dependent on some overarching regularity, G. Then we need to consider both the extrapolation of H, assuming G does not change, and the potential for G to change. Probability, information, value and entropy are all defined relative to H, and hence only relevant so long as G does not change. Hence a situation needs to be considered on a number of levels. If G might well change then a utility maximization approach is no longer appropriate. Instead one may want to, for example, minimize the probability of failure. But this takes us beyond Good.

1. Blue Aurora says: