Statistical Inference: Keynes’ Treatise
Keynes’ Treatise on Probability discusses statistical inference, including ‘the law of large numbers’, at some length, cautioning against the unprincipled application of techniques with hidden assumptions. But at the end he seems to recant, noting that science often finds great homogeneity and regularity, and thus tends to justify the assumptions implicit in the statistical methods that it relies upon. Thus while in theory the conventional theory is ill-founded, Keynes invites us to suppose that in practice the methods that it advocates work well enough. This was largely written before his Great War experience, where he found things considerably less homogenous and regular than had been the case in peace-time. In such cases, his detailed argumentation on complexity and uncertainty seem far from academic: For those faced challenging real-world situations, Keynes’ reasoning is more important than his conclusions.
Part V The Foundations of Statistical Inference
Ch. XXVII The Nature of Statistical Inference
1. The Theory of Statistics, as it is now understood, can be divided into two parts which are for many purposes better kept distinct. The first function of the theory is purely descriptive. It devises numerical and diagrammatic methods by which certain salient characteristics of large groups of phenomena can be briefly described; and it provides formulae by the aid of which we can measure or summarise the variations in some particular character which we have observed over a long series of events or instances. The second function of the theory is inductive. It seeks to extend its description of certain characteristics of observed events to the corresponding characteristics of other events which have not been observed. This part of the subject may be called the Theory of Statistical Inference ; and it is this which is closely bound up with the theory of probability.
2. … The statistician, who is mainly interested in the technical methods of his science, is less concerned to discover the precise conditions in which a description can be legitimately extended by induction. He slips somewhat easily from one to the other, and having found a complete and satisfactory mode of description he may take less pains over the transitional argument, which is to permit him to use this description for the purposes of generalisation.
One or two examples will show how easy it is to slip from description into generalisation. Suppose that we have a series of similar objects one of the characteristics of which is under observation; a number of persons, for example; whose age at death has been recorded. We note the proportion who die at each age, and plot a diagram which displays these facts graphically. We then determine by some method of curve fitting a mathematical frequency curve which passes with close approximation through the points of our diagram. If we are given the equation to this curve, the number of persons who are comprised in the statistical series, and the degree of approximation (whether to the nearest year or month) with which the actual age has been recorded, we have a very complete and succinct account of one particular characteristic of what may constitute a very large mass of individual records. In providing this comprehensive description the statistician has fulfilled his first function. But in determining the accuracy with which this frequency curve can be employed to determine the probability of death at a given age in the population at large, he must pay attention to a new class of considerations and must display a different kind of capacity. He must take account of whatever extraneous knowledge may be available regarding the sample of the population which came under observation, and of the mode and conditions of the observations themselves. Much of this may be of a vague kind, and most of it will be necessarily incapable of exact, numerical, or statistical treatment. He is faced, in fact, with the normal problems of inductive science, one of the data, which must be taken into account, being given in a convenient and manageable form by the methods of descriptive statistics.
The truth of this is obvious; yet, not unnaturally, the more complicated and technical the preliminary statistical investigations become, the more prone inquirers are to mistake the statistical description for an inductive generalisation. This tendency, which has existed in some degree, as, I think, the whole history of the subject shows, from the eighteenth century down to the present time, has been further encouraged by the terminology in ordinary use. For several statistical coefficients are given the same name when they are used for purely descriptive purposes, as when corresponding coefficients are used to measure the force or the precision of an induction. The term ‘probable error’ for example, is used both for the purpose of supplementing and improving a statistical description, and for the purpose of indicating the precision of some generalisation. The term ‘correlation’ itself is used both to describe an observed characteristic of particular phenomena and, in the enunciation of an inductive law which relates to phenomena in general.
Here Keynes quotes Whitehead:
“There is no more common error than to assume that, because prolonged and accurate mathematical calculations have been made, the application of the result to some fact of nature is absolutely certain.”
Keynes conception of probability is relative to what you know. Suppose that you know that 10% of the British population have red hair, and have no other statistical data. What is the probability that Luigi, a British citizen of Italian stock, has red hair? A common practice is to ignore the factors for which you have no statistics, and to use the data for the smallest population that includes the subject, in this case 10%. Keynes notes that this would only be valid if we knew that the factors that have been ignored are irrelevant. In this case, I do not think they are.
A solution to this would be to say that the probability is 10%, conditional on the assumption. That leaves open the probability if we do not believe the assumptions. But in some case we could go further. For example, if the 10% figure is based on a random sample and 20% of the population are of Italian stock, then the probability is at most 50% (50% of 20% is 10%.) Even then do we know if Luigi is Italian for red-head?
Ch. XXVIII The Law of Great Numbers
Keynes says of Poisson’s introduction of the ‘law’, now called ‘the law of large numbers’:
1. … This is the language of exaggeration; it is also extremely vague. But it is exciting; it seems to open up a whole new field to scientific investigation; and it has had a great influence on subsequent thought. Poisson seems to claim that, in the whole field of chance and variable occurrence, there really exists, amidst the apparent disorder, a discoverable system. Constant causes are always at work and assert themselves in the long run, so that each class of event does eventually occur in a definite proportion of cases. It is not clear how far Poisson’s result is due to à priori reasoning, and how far it is a natural law based on experience; but it is represented as displaying a certain harmony between natural law and the à priori reasoning of probabilities.”
On applications of the supposed law, Keynes notes:
2. The existence of numerous instances of the Law of Great Numbers, or of something of the kind, is absolutely essential for the importance of Statistical Induction. Apart from this the more precise parts of statistics, the collection of facts for the prediction of future frequencies and associations, would be nearly useless. But the ‘Law of Great Numbers’ is not at all a good name for the principle which underlies Statistical Induction. The ‘Stability of Statistical Frequencies’ would be a much better name for it. The former suggests, as perhaps Poisson intended to suggest, but what is certainly false, that every class of event shows statistical regularity of occurrence if only one takes a sufficient number of instances of it. It also encourages the method of procedure, by which it is thought legitimate to take any observed degree of frequency or association, which is shown in a fairly numerous set of statistics, and to assume with insufficient investigation that, because the statistics are numerous, the observed degree of frequency is therefore stable. Observation shows that some statistical frequencies are, within narrower or wider limits, stable. But stable frequencies are not very common, and cannot be assumed lightly.“
Later Keynes studied economics, in which one could have long stabilities punctuated by shocks, in contrast to this supposed law. But the law can be saved by restricting it to random variables drawn from (Bayesian) probabilistic mechanisms, noting that although economics is not deterministic, it appears not to be probabilistic either: it is uncertain.
Ch. XXIX The Use of A Priori Probabilities for the Prediction of Statistical frequency …
1. Bernoulli’s Theorem [concerning the variability of sample proportions] is generally regarded as the central theorem of statistical probability. It embodies the first attempt to deduce the measures of statistical frequencies from the measures of individual probabilities, and …out of it the conception first arose of general laws amongst masses of phenomena, in spite of the uncertainty of each particular case. But, as we shall see, the theorem is only valid subject to stricter qualifications, than have always been remembered, and in conditions which are the exception, not the rule.
5. … Thus Bernoulli’s Theorem is only valid if our initial data are of such a character that additional knowledge, as to the proportion of failures and successes in one part of a series of cases is altogether irrelevant to our expectation as to the proportion in another part. …
Such a condition is very seldom fulfilled. If our initial probability is partly founded upon experience, it is clear that it is liable to modification in the light of further experience. It is, in fact, difficult to give a concrete instance of a case in which the conditions for the application of Bernoulli’s Theorem are completely fulfilled.
7. It seldom happens, therefore, that we can apply Bernoulli’s Theorem with reference to a long series of natural events. For in such cases we seldom possess the exhaustive knowledge which is necessary. Even where the series is short, the perfectly rigorous application of the Theorem is not likely to be legitimate, and some degree of approximation will be involved in utilising its results.
Adherents of the Frequency Theory of Probability, who use the principal conclusion of Bernoulli’s Theorem as the defining property of all probabilities, sometimes seem to mean no more than that, relative to given evidence, every proposition belongs to some series, to the members of which Bernoulli’s Theorem is rigorously applicable. But the natural series, the series, for example, in which we are most often interested, … is not, as a rule, rigorously subject to the Theorem.
9. If, for instance, balls are drawn from a bag, which is one, but it is not certainly known which, out of a number of bags containing black and white balls in differing proportions, the knowledge of the colour of the first ball drawn affects the probabilities at the second drawing, because it throws some light upon the question as to which bag is being drawn from.
This last type is that to which most instances conform which are drawn from the real world. A knowledge of the characteristics of some members of a population may give us a clue to the general character of the population in question. Yet it is this type, where there is a change in knowledge but no change in the material conditions from one instance to the next, which is most frequently overlooked.“
Keynes gives the following examples:
5. … For consider the case of a coin of which it is given that the two faces are either both heads or both tails: at every toss, provided that the results of the other tosses are unknown, the probability of heads is and the probability of tails is 1/2; yet the probability of m heads and m tails in 2m tosses is zero, and it is certain à priori that there will be either 2m heads or none. Clearly Bernoulli’s Theorem is inapplicable to such a case. And this is but an extreme case of a normal condition.
6. … If we are given a penny of which we have no reason to doubt the regularity, the probability of heads at the first toss is 1/2 ; but if heads fall at every one of the first 999 tosses, it becomes reasonable to estimate the probability of heads at the thousandth toss at much more than 1/2 . For the à priori probability of its being a conjurer’s penny, or otherwise biassed so as to fall heads almost invariably, is not usually so infinitesimally small as (1/2 )1000. We can only apply Bernoulli’s Theorem with rigour for a prediction as to the penny’s behaviour over a series of a thousand tosses, if we have à priori such exhaustive knowledge of the penny’s constitution and of the other conditions of the problem that 999 heads running would not cause us to modify in any respect our prediction à priori.
Ch. XXX The Mathematical use of Statistical Frequencies for the Determination of Probabilities A Posteriori – The Methods of Laplace
1. … I do not myself believe that there is any direct and simple method by which we can make the transition from an observed numerical frequency to a numerical measure of probability. The problem, as I view it, is part of the general problem of founding judgments of probability upon experience, and can only be dealt with by the general methods of induction expounded in Part III. The nature of the problem precludes any other method, and direct mathematical devices can all be shown to depend upon insupportable assumptions. In the next chapters we will consider the applicability of general inductive methods to this problem, and in this we will endeavour to discredit the mathematical charlatanry by which, for a hundred years past, the basis of theoretical statistics has been greatly undermined.
2. … Leibniz … goes to the root of the difficulty. The calculation of probabilities is of the utmost value, he says, but in statistical inquiries there is need not so much of mathematical subtlety as of a precise statement of all the circumstances. The possible contingencies are too numerous to be covered by a finite number of experiments, and exact calculation is, therefore, out of the question. Although nature has her habits, due to the recurrence of causes, they are general, not invariable. Yet empirical calculation, although it is inexact, may be adequate in affairs of practice.
4. … We have seen in the preceding chapter that Bernoulli’s Theorem itself cannot be applied to all kinds of data indiscriminately, but only when certain rather stringent conditions are fulfilled. Corresponding conditions are required equally for the inversion of the theorem, and it cannot possibly be inferred from a statement of the number of trials and the frequency of occurrence merely, that these have been satisfied. We must know, for instance, that the examined instances are similar in the main relevant particulars, both to one another and to the unexamined instances to which we intend our conclusion to be applicable. An unanalysed statement of frequency cannot tell us this.
Keynes is critical of Pearson’s Rule of Succession and of the Principle of Indifference upon which it is based.
12. … [Pearson’s challenge is this:] “Those who do not accept the hypothesis of the equal distribution of ignorance and its justification in observation are compelled to produce definite evidence of the clustering of chances, or to drop all application of past experience to the judgment of probable future statistical ratios … .”
13. … The challenge is easily met. … [E]xperience, if it shows anything, shows that there is a very marked clustering of statistical ratios in the neighbourhoods of zero and unity, of those for positive theories and for correlations between positive qualities in the neighbourhood of zero, and of those for negative theories and for correlations between negative qualities in the neighbourhood of unity. Moreover, we are seldom in so complete a state of ignorance regarding the nature of the theory or correlation under investigation as not to know whether or not it is a positive theory or a correlation between positive qualities. In general, therefore, whenever our investigation is a practical one, experience, if it tells us anything, tells us not only that the statistical ratios cluster in the neighbourhood of zero and unity, but in which of these two neighbourhoods the ratio in this particular case is most likely a priori to be found. If we seek to discover what proportion of the population suffer from a certain disease, or have red hair, or are called Jones, it is preposterous to suppose that the proportion is as likely a priori to exceed as to fall short of (say) fifty per cent.
As Professor Pearson applies this method to investigations where it is plain that the qualities involved are positive, he seems to maintain that experience shows that there are as many positive attributes which are shared by more than half of any population as there are which are shared by less than half.
It may be added that the conclusions, which Professor Pearson himself derives from this method, provide a reductio ad absurdum of the arguments upon which they rest. He considers, for example, the following problem : A sample of 100 of a population shows 10 per cent affected with a certain disease. What percentage may be reasonably expected in a second sample of 100 ? By approximation he reaches the conclusion that the percentage of the character in the second sample is as likely to fall inside as outside the limits, 7.85 and 13.71. Apart from the preceding criticisms of the reasoning upon which this depends, it does not seem reasonable upon general grounds that we should be able on so little evidence to reach so certain a conclusion. The argument does not require, for example, that we have any knowledge of the manner in which the samples are chosen, of the positive and negative analogies between the individuals, or indeed anything at all beyond what is given in the above statement.
The method is, in fact, much too powerful. It invests any positive conclusion, which it is employed to support, with far too high a degree of probability. Indeed this is so foolish a theorem that to entertain it is discreditable.
Keynes is critical of the notion that factors should be discounted and distributions should be considered uniform unless one has a specific reason to do otherwise. He goes on, in effect, to suggest that factors should be considered possibly relevant and distributions probably non-uniform, unless one has evidence that they are actually irrelevant and the distributions actually uniform.
Ch. XXXI The Inversion of Bernoulli’s Theorem
1. I CONCLUDE, then, that the application of the mathematical methods, discussed in the preceding chapter, to the general problem of statistical inference is invalid. Our state of knowledge about our material must be positive, not negative, before we can proceed to such definite conclusions as they purport to justify. To apply these methods to material, unanalysed in respect of the circumstances of its origin, and without reference to our general body of knowledge, merely on the basis of arithmetic and of those of the characteristics of our material with which the methods of descriptive statistics are competent to deal, can only lead to error and to delusion.
But I go further than this in my opposition to them. Not only are they the children of loose thinking, and the parents of charlatanry. Even when they are employed by wise and competent hands, I doubt whether they represent the most fruitful form in which to apply technical and mathematical methods to statistical problems, except in a limited class of special cases. …
2. … If we knew that our material could be likened to a game of chance, we might expect to infer chances which we infer frequencies from chances from frequencies, with the same sort of confidence as that with which we infer frequencies from chances. This part of our inquiry will not be complete, therefore, until we have endeavoured to elucidate the conditions for the validity of an Inversion of Bernoulli’s Theorem.
5. … Nobody supposes that we can measure exactly the probability of an induction. Yet many persons seem to believe that in the weaker and much more difficult type of argument, where the association under examination has been in. our experience, not invariable, but merely in a certain proportion, we can attribute a definite measure to our future expectations and can claim practical certainty for the results of predictions which lie within relatively narrow limits. Coolly considered, this is a preposterous claim, which would have been universally rejected long ago, if those who made it had not so successfully concealed themselves from the eyes of common sense in a maze of [psuedo-]mathematics.
6. … The first assumption amounts, in the language of statisticians, to an assumption of random sampling from amongst the A’s.The second assumption corresponds precisely to the similar condition which we discussed fully in connection with inductive generalisation. The instances of A(x) may be the result of random sampling, and yet it may still be the case that there are material circumstances, common to all the examined instances of B(x), yet not covered by the statement A(x)B(x). In so far as these two assumptions are not justified, an element of doubt and vagueness, which is not easily measured, assails the argument. It is an element of doubt precisely similar to that which exists in the case of generalisation. But we are most likely to forget it. For having overcome the difficulties peculiar to correlation, it is, possibly, not unnatural for a statistician to feel as if he had overcome all the difficulties.
In practice, however, our knowledge, in cases of correlation just as in cases of generalisation, will seldom justify the assumption of perfect analogy between the B’s ; and we shall be faced by precisely the same problems of analysing and improving our knowledge of the instances, as in the general case of induction already examined. If B has invariably accompanied A in 100 cases, we have all kinds of difficulties about the exact character of our evidence before we can found on this experience a valid generalisation. If B has accompanied A, not invariably, but only 50 times in the 100 cases, clearly we have just the same kind of difficulties to face, and more too, before we can announce a valid correlation. Out of the mere analysed statement that B has accompanied A as often as not in 100 cases, without precise particulars of the cases, or even if there were 1,000,000 cases instead of 100, we can conclude very little indeed.
For Keynes, it is not reasonable to assume that a process is random just because you know nothing about it. In otherwords, contrary to what Bayesians suppose, it is not always reasonable to treat epistemic uncertainty (lack of knowledge) as if it were aleatoric (due to chance).
Ch. XXXII … The Methods of Lexis
1. No one supposes that a good induction can be arrived at merely by counting cases. The business of strengthening the argument chiefly consists in determining whether the alleged association is stable, when the accompanying conditions are varied. This process of improving the Analogy, as I have called it in Part III., is, both logically and practically, of the essence of the argument.
Now in statistical reasoning (or inductive correlation) that part of the argument, which corresponds to counting the cases in inductive generalisation, may present considerable technical difficulty. This is especially so in the particularly complex cases of what in the next chapter ( 9) I shall term Quantitative Correlation, which have greatly occupied the attention of English statisticians in recent years. But clearly it would be an error to suppose that, when we have successfully overcome the mathematical or other technical difficulties, we have made any greater progress towards establishing our conclusion than when, in the case of inductive generalisation, we have counted the cases but have not yet analysed or compared the descriptive and non- numerical differences and resemblances. In order to get a good scientific argument we still have to pursue precisely the same scientific methods of experiment, analysis, comparison, and differentiation as are recognised to be necessary to establish any scientific generalisation. These methods are not reducible to a precise mathematical form for the reasons examined in Part III. of this treatise. But that is no reason for ignoring them, or for pretending that the calculation of a probability, which takes into account nothing whatever except the numbers of the instances, is a rational proceeding. …
Generally speaking, therefore, I think that the business of statistical technique ought to be regarded as strictly limited to preparing the numerical aspects of our material in an intelligible form, so as to be ready for ‘the application of the usual inductive methods. Statistical technique tells us how to ‘ count the cases ‘ when we are presented with complex material. It must not proceed also, except in the exceptional case where our evidence furnishes us from the outset with data of a particular kind, to turn its results into probabilities ; not, at any rate, if we mean by probability a measure of rational belief.
2. There is, however, one type of technical, statistical investigation not yet discussed, which seems to me to be a valuable aid to inductive correlation. This method consists in breaking up a statistical series, according to appropriate principles, into a number of sub-series, with a view to analysing and measuring, not merely the frequency of a given character over the aggregate series, but the stability of this frequency amongst the sub-series ; that is to say, the series as a whole is divided up by some principle of classification into a set of sub-series, and the fluctuation of the statistical frequency under examination between the various sub-series is then examined. It is, in fact, a technical method of increasing the Analogy between the instances, in the sense given to this process in Part III.
5. …. Suppose that one is endeavouring to establish an inductive correlation, e.g. that the chance of a male birth is m. The conclusion, which we are seeking to establish, takes no account of the place or date of birth or the race of the parents, and assumes that these influences are irrelevant. Now, if we had statistics of birth ratios for all parts of the world throughout the nineteenth century, and added them all up and found that the average frequency of male births was m, we should not be justified in arguing from this that the frequency of male births in England next year is very unlikely to diverge widely from m. For this would involve the unwarranted assumption, in Bortkiewicz’s terminology, that the empirical probability m is elementary for any resolution dependent on time or place, and is not an average probability compounded out of a series of groups, relating to different times or places, to each of which a distinct special probability is applicable. And, in my terminology, it would assume that variations of time and place were irrelevant to the correlation, without any attempt having been made to employ the methods of positive and negative Analogy to establish this.
We must, therefore, break up our statistical material into groups by date, place, and any other characteristic which our generalisation proposes to treat as irrelevant. By this means we shall obtain a number of frequencies m1‘, m2′, m3′, … m1”, m2′‘, m3”,…. etc., which are distributed round the average frequency m. For simplicity let us consider the series of frequencies m1‘, m2′, m3′, … obtained by breaking up our material according to the date of the birth. If the observed divergences of these frequencies from their mean are not significant, we have the beginnings of an inductive argument for regarding date as being in this connection irrelevant.
6. … It may be the case that some of the sub-frequencies show such wide and discordant variations from the mean as to suggest that some significant Analogy has been overlooked. In this event the lack of symmetry, which characterises the oscillations, may be taken to indicate that some of the sub-groups are subject to a relevant influence, of which we must take account in our generalisation, to which some of the other sub-groups are not subject.
… undue regularity is as fatal to the assumption of Bernoullian conditions as is undue dispersion.
Keynes (7.) describes Lexis’ concept of supra- and sub- normal dispersion, relative to what would be expected in the homogenous case. Subnormal dispersion indicates ‘connexité’, organic connection between terms. Using poetry as an example, Keynes notes that:
… Lexis is wrong if he supposes that a super-normal dispersion cannot also arise out of connexité or organic connection between the successive terms. It might have been the case that the appearance of a dactyl in one foot increased the probability of another dactyl in that line. …
Supranormal dispersion increases the residual risk after aggregation. Consequently:
9. … [T]he actuary does not like an undue proportion of his cases to be drawn from a group which may be subject to a common relevant influence for which he has not allowed. If the a priori calculations are based on the average over a field which is not homogeneous in all its parts, greater stability of result will be obtained if the instances are drawn from all parts of the non-homogeneous total field, than if they are drawn now from one homogeneous sub-field and now from another. This is not at all paradoxical… .
Keynes follows Lexis, and the scientific approach, in seeking to identify all significant relevant factors. He advocates checking the sample variability for consistency with the assumption of homogeneity and stability. [Scientist might also advocate always seeking to identify further factors, rather than resting on Lexis’ check.]
Ch. XXXIII Outline of a Constructive Theory
Next, Keynes gives his constructive theory. In essence, he supposes that the world really does satisfy the assumptions identified above. (We may not agree.)
In Keynes later work on economics, Keynes’ critique of conventional statistical influence, above, seems much more important than his constructive theory, which would otherwise seem to form the ‘findings’ of his Treatise.