‘Understanding Uncertainty’ has a blog (‘uu blog’) on ESP and significance. The challenge for those not believing in ESP is an experiment which seems to show ‘statistically significant’ but mild ESP. This could be like a drug company that tests lots of drugs until it gets a ‘statistically significant’ result, but from the account it seems more significant than this.

The problem for an ESP atheist who is also a Bayesian is in trying to interpret the result of a significance test as a (subjective) probability that some ESP was present, as the above blog discusses. But from a sequential testing point of view (e.g. of Wald) we would simply take the significance as a threshold which stimulates us to test the conclusion. In typical science one would repeat the experiment and regard it as significant if the result was not repeated. But with ESP the ‘aura’ of the experimenter or place may be significant, so a failure by others to replicate a result may simply mean that only sometimes is ESP shown in the experimental set-up. So what is a ‘reasonable’ acceptance criterion?

Jack Good discussed the issues arising from ESP in some detail, including those above. He developed the notion of ‘weight of evidence’, which is the log of the appropriate likelihood ratio. There are some technical differences to the approach of the ‘uu blog’. They offer some advantages.

If e is the evidence/data obtained from an experiment and h is a hypothesis (e.g. the null hypothesis) then P(e|h) denotes the likelihood, where P() is the (Bayesian) probability. To be well-defined the likelihood should be entailed by the hypothesis.

One problem is that the likelihood depends on the granularity with which we measure the data, and so – on its own – is meaningless. In significance testing one defines E(e) to be the set of all data that is at least as ‘extreme’ as e, and uses the likelihood P(E(e)|h) to determine ‘1-significance’. But (as in ‘uu blog’) what one really wants is P(¬h|e).

In this experiment one is not comparing one theory or model with another, but a statistical ‘null hypothesis’ with its complement, which is very imprecise, so that it is not clear what the appropriate likelihood is. ‘uu blog’ describes the Bayesian approach, of having prior distributions as to how great an ESP effect might be, if there is one. To me this is rather like estimating how many angels one could get on a pin-head. An alternative is to use Jack Good’s ‘generalized likelihood’. In principal one considers all possible theories and takes the likelihood of the one that best explains the evidence. This is then used to form a likelihood ratio, as in ‘uu blog’, or the log likelihood is used as a ‘weight of evidence, as at Bletchley Park. In this ESP case one might consider subjects to have some probability of guessing correctly, varying the probability to get the best likelihood. (This seems to be about 52% as against the 50% of the null hypothesis.) Because the alternative to the null hypothesis includes biases that are arbitrarily close to the null hypothesis, one will ‘almost always’ find some positive or negative ESP effect. The interesting thing would be to consider the distribution of such apparent effects for the null hypothesis, and hence judge the significance of a result of 52%.

This seems a reasonable thing to do, even though there may be many hypotheses that we haven’t considered and so our test is quite weak. It is up to those claiming ESP to put forward hypotheses for testing.

A difficulty of the above procedure is that investigators and journals only tend to report positive results (‘uu blog’ hints at this). According to Bayesians one should estimate how many similar experiments have been done first and then accept ESP as ‘probable’ if a result appears sufficiently significant. I’m afraid I would rather work the other way: assess how many experiments there would have to be to make an apparently significant result really significant, and then judge whether it was credible that so many experiments had been done. Even if not, I would remain rather cynical unless and until the experiment could be refined to give a more definite and repeatable effect. Am I unscientific?

Dave Marsay