Mayo & Spanos Error Statistics

Deborah G. Mayo and Aris Spanos “Error StatisticsPhilosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, 2011 (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

Approach

Error statistics are a completely different approach to the Bayesian one. They concern the probability that a test would have falsified a hypothesis if false, not the probability that a hypothesis is true. The theory develops as follows:

Error probabilities are computed from the distribution of d(X), the sampling distribution, evaluated under various hypothesized values of θ.

Severity Principle (weak). Data x0 (produced by process G) do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H, even if H is incorrect.

Severity Principle (full). Data x0 (produced by process G) provides good evidence for hypothesis H (just) to the extent that test T severely passes H with x0.

Example

Suppose we are testing whether and how much weight George has gained between now and the time he left for Paris, and do so by checking if any difference shows up on a series of well-calibrated and stable weighing methods, both before his leaving and upon his return. If no change on any of these scales is registered, even though, say, they easily detect a difference when he lifts a .1-pound potato, then this may be regarded as grounds for inferring that George’s weight gain is negligible within limits set by the sensitivity of the scales.…

A behavioristic rationale might go as follows: If one always follows the rule going from failure to detect a weight gain after stringent probing to inferring weight gain no greater than δ, then one would rarely be wrong in the long run of repetitions. While true, this is not the rationale we give in making inferences about George.

We may describe this as the notion that the long run error probability ‘rubs off’ on each application. What we wish to sustain is this kind of counterfactual statistical claim: that were George to have gained more than δ pounds, at least one of the scales would have registered an increase. This is an example of what philosophers often call an argument from coincidence: it would be a preposterous coincidence if all the scales easily registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading us when applied to an object of unknown weight.

Comment: The notion that this particular finding is reliable, rather than that the method is reliable ‘on average’, seems to me crucial. For example, a study might find that a medicine is effective with few side-effects for the population as a whole, from which it is imputed that the medicine will probably be good for me. But such claims have not been ‘severely tested’.

On the other hand, being a little pedantic, I am suspicious of the George example:

• Most people’s weight measurable fluctuates throughout the day, generally being greater just after they have eaten or drunk. Hence George weighing just the same seems an unlikely coincidence.
• If George knew he was going to be weighed might he not have cheated, perhaps by varying the number of coins in his pocket to equalise the weight?

In the kind of survey results that the authors envisage this may not be a problem, but it does seem that the notion of severity is not completely general.

Severity

Is characterised by:

Passing a Severe Test.

We can encapsulate this as follows:

A hypothesis H passes a severe test T with data x0 if

• (S-1) x0 accords with H, (for a suitable notion of accordance) and
• (S-2) with very high probability, test T would have produced a result that accords less well with H than x0 does, if H were false or incorrect.

Equivalently, (S-2) can be stated:

• (S-2)*: with very low probability, test T would have produced a result that accords as well as or better with H than x0 does, if H were false or incorrect.

Comment: The probabilities in S-2 depend on the sampling distribution, which is assumed known in the cases of interest to the authors. But otherwise, as with George, it might not be.

Likelihoods

The paper is critical of the likelihood principle (LP: that only the likelihood matters).

Maximally Likely alternatives. H0 might be that a coin is fair, and x0 the result of n flips of the coin. For each of the 2n possible outcomes there is a hypothesis H∗i that makes the data xi maximally likely. For an extreme case, H∗i can assert that the probability of heads is 1 just on those tosses that yield heads, 0 otherwise. For any xi, P(xi; H0) is very low and P(xi;H∗i ) is high — one need only choose for (a) the statistical hypothesis that renders the data maximally likely, i.e., H∗i . So the fair coin hypothesis is always rejected in favour of H∗i , even when the coin is fair. This violates the severity requirement since it is guaranteed to infer evidence of discrepancy from the null hypothesis even if it is true. The severity of ‘passing’ H∗i is minimal or 0.

Comment. This is true, but isn’t the likelihood principle about S-1, not S-2? The expected likelihood function peaks at the true value, but the sample likelihood function has some additional ‘noise’, so the peak may be offset. In assessing the bias (or otherwise) of a coin one should take account not only where the likelihood has its peak but how broad that peak is.

Alternatively, maybe the force of this example is that alternative methods just seek to identify the most likely value of some parameter,  whereas in this case the authors want to give the hypothesis that the coin is fair a special role. I imagine that either approach can be justified, for different cases.

The paper continues:

Holding the LP runs counter to distinguishing data on grounds of error probabilities of procedures. “According to Bayes’s theorem, P(x|μ)…constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|μ) and P(y|μ)are proportional functions of μ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of μ. . . ”. [Savage, 1962, p. 17]

Comment: One needs to interpret Savage with care. He is saying that the estimates of μ should be the same in both cases. He is not saying that the amount of data considered is irrelevant.

The paper continues:

The holder of the LP considers the likelihood of the actual outcome, i.e., just d(x0), whereas the error statistician needs to consider, in addition, the sampling distribution of d(X) or other statistic being used in inference. In other words, an error statistician could use likelihoods in arriving at (S-1) the condition of accordance or fit with the data, but (S-2) additionally requires considering the probability of outcomes x that accord less well with a hypotheses of interest H, were H false.

Comment: It is not sufficient to consider just the likelihoods if one wants to consider such things as the probability that an experimental test would have falsified a hypothesis if false.

Comment

The paper conflates two things:

1. The notion of severe testing, showing how meaningful the failure to falsify a hypothesis.
2. The assumption that there will be some natural ‘null hypothesis’ that should be given a special standing.

The first point is important. The second is important when true. For example in medicine the ‘null effect’ hypothesis seems natural. But in other settings, such as measuring the strength of  a new material, the Bayesian approach seems more likely, and one is interested in the probable error (or similar) of the estimate, not its falsifiability. It seems important, then, to be aware of the underlying theory and apply it as appropriate, rather than seek a universal method.

A supplement to severe testing would be to give a set of hypotheses that with some significant probability could have given rise to data that accords with H at least as well as the actual data does and which taken together are representative of the maximal set of such hypotheses. For example one would expect that for a coin that was actually fair, this set would nearly always include the hypothesis of a fair coin. (Variations are possible.)

The notion of ‘severe testing’ could also do with generalising beyond the scope of this paper.

Dave Marsay