Pearl’s Simpson’s Paradox
Judea Pearl Simpson’s Paradox: An Anatomy Technical Report R-264 April 1999.
To me, this is the key to Judea’s rightly well-regarded book. It may be thought of as having two sides: a summary and critique of what went before and a candidate positive theory. The positive theory relies on one having definite knowledge of lack of causality, as when a person’s gender cannot be caused by taking aspirin. My focus here is on the critique.
Tale of a Non-Paradox
In Simpson’s Paradox one has C, E, F such that:
P(E|C) > P(E|¬C) yet
P(E|C,F) < P(E|¬C,F)
P(E|C,¬F) < P(E|¬C,¬F).
If C is regarded as a ’cause’, E an ‘effect’ and F is some other ‘factor’ then this says that the cause reduces the effect for either value of the ‘confounding’ factor and hence (intuitively) one would expect the cause to reduce the effect for the combined situation, (F or ¬F). But in this case the effect is increased, reversing the inequality. The ‘paradox’ is simply that probabilistic logic conflicts with (intuitive) causal logic. ‘Correlation is not causality’.
The ‘standard’ work-around is to (try to) identify all confounding factors, derive probabilities for them, and then average. (That is, one takes:
P(E|x) = P(E|x,F).P(F)+P(E|x,¬F).P(¬F).
But then one is no longer dealing with probability as such, but with a causal logic, probability being confined to the estimation of causality where there are (assumed to be) no confounding factors.)
A Tale of Statistical Agony
From a traditional statistical perspective causality is a mental construct and so cannot be the ’cause’ of the apparent paradox, since the effect really does show up in the data. The paradox is thus resolved by prohibiting any arguments based on notions of causality.
Causality versus Exchangeability
[There] is no statistical criterion that would warn the investigator against drawing the wrong conclusions or would indicate which table indicates the correct answer.
(The troubling inequalities are conventionally put into tables.)
De Finetti had proposed a principle of exchangeability, where the joint probability distribution over factors was invariant under permutation. Pearl (and others) consider this to be non-statistical and only comprehensible if underpinned by casual notions. The justification for this view is given in terms of ‘experimental studies’.
Pearl argues that exchangeability is a circumlocution for causality, which only came about because causality was not then ‘well-defined’. He seeks to remedy this.
A Paradox Resolved: Or, what kind of machine is man?
To resolve the paradox we must either (a) show that our causal intuition is misleading or incoherent or (b) deny … that causal relationships are governed by the laws of standard probability calculus.
Judea introduces a do() operator. Thus P(A|B) means ‘the probability of A given B’, while P(A|do(B)) means ‘the probability of A given that we do B’. (The difference is that do(B) is under our control and typically ‘under experimental conditions’, whereas a ‘given’ B is not, and any mechanism may be misunderstood or unknown.)
Theorem 1: (The Sure-Thing Principle)
An action C that increases the probability of an event E in each subpopulation must also increase the probability of E in the population as a whole, provided that the action does not change the distribution of the populations.
(I.e., If, for each confounding factor, F, P(E|do(C),F) > P(E|do(¬C),F) and P(F|do(C))=P(F) then P(E|do(C))>P(E|do(¬C)).)
This differs from Savage’s sure-thing principle in being conditional. It is not assumed that the process is either randomized (P(E|do(C))=P(E|C)) or balanced (P(C|F)=P(C|¬F)).
Despite the admonitions of statisticians, people tend to think in terms of causal relations. (And so one should deal with this explicitly, as Pearl does.)
The positive theory presupposes that one has definite knowledge of lack of influence and is able to produce acyclic causal models. This may be appropriate for the controlled experiments that are Judea’s main concern, but my own experience has been very different. Either conventional approaches, combined with some common sense and experience, worked, or one had no grounds to apply Judea’s approach with confidence. Possibly this is because his work is so well known that it is in effect ‘the state of the art’, so that people no longer make the kinds of mistakes that he is trying to correct. But it seems to me that some problems remain, for which his critique is still relevant. Medicine and economics, for example, seem too complex for his methods, yet his paper still seems worth reading.
The crux of Pearl’s approach is that causal modelling is needed to underpin exchangeability, which he regards as extra-statistical. But from a logical perspective if one lacks well-founded causal models one could at least – in principle – test directly for exchangeability, or at least for the presence of confounding factors. (Although I have reservations, they are less than those for causality.)
Pearl clearly thinks of his sure-thing principle as something that one applies in advance, knowing that do(C) does not affect the distribution of F. But there seems no reason why one shouldn’t test for the affect of do(C) (statistically) and apply the theory retrospectively. There is also a subtlety in the textual version of the theorem. Suppose that if any given doctor over-prescribed anti-biotics, this would benefit his patients, but if all doctors over-prescribe there would be a dis-benefit, a ‘tragedy of the commons’. Thus we need to distinguish between do(C) and do(C|F), meaning do C restricted to F, but not elsewhere. In particular, one cannot use Theorem 1 to extrapolate beyond the subject population. It is thus not as widely applicable as one might think.
Pearl briefly refers to the work of Good. He uses a notation P(A|B:C) to show that a conditional probability P(A|B) depends on the context, C. In this notation one has:
P(E|C:T) > P(E|¬C:T) (where T is the total context) yet one also has the reverse:
P(E|C:F) < P(E|¬C:F)
P(E|C:¬F) < P(E|¬C:¬F).
Good regards the axioms of probability theory as holding within a context, and the notation draws our attention to the fact that here we need to work across contexts.
If we regard F as a sub-context of T then the proof of theorem 1 hinges on the standard Bayesian equality
P(E|X:T) = P(E|X:F).P(F:T)+P(E|X:¬F).P(¬F:T) (B)
To apply this, as Pearl emphasises, it is vital that F forms a ‘context’ that is not altered by doing X. Thus one way of seeing the ‘paradox’ is that when people are told about P(A|B) they assume that it is in an appropriate context and in this sense is ‘causal’. But if the effect of B on A is indirect then ‘doing B’ may be a different context from B having happened for other reasons. Thus theorem 1 seems to assume some causal model.
It should also be noted that:
- P(F:T)+P(¬F:T)=1, so if we are given P(E|X:T), P(E|X:F) and P(E|X:¬F) then we can use the above equation to solve for P(F:T).
- From (B), intuitions about P(E|X:T) correspond to intuitions about P(F:T), such as that it is ‘balanced’ (in this case, roughly 0.5).
The Principle of Indifference
In Theorem 1 P(F:T) is unknown. By the principle of indifference if we are not given any P( | :T) then we ‘should assume P(F)=P(¬F)=0.5, so
P(E|X:T) = (P(E|X:F)+P(E|X:¬F))/2.
Thus we ‘should’ expect P(E|C) < P(E|¬C), and if not we should suspect that either P(F) or P(¬F) are small (as they are in the examples).
Resolving the paradox
Interpretation as proportion
The notation P(A|B) suggests, by default context independence. But even P(A|B:C) is by no means straightforward to interpret. The mathematical approach to probability has axioms that are satisfied by any suitable measure, including by proportions Pn(A|B:C), where proportion of those entities in context C that satisfy B that also satisfy A. Such a measure is independent of any notion of random sampling, double-blind trials etc. With this interpretation Simpson’s paradox may seem less surprising.
Boole considers constraints on sets of possible probability assignments, rather than seeking precise probabilities.
From (B), if – for example – P(E|X:F) < P(E|X:¬X) then
P(E|X:F) <= P(E|X:T) <= P(E|X:¬F).
Thus if P(E|X:F) << P(E|X:¬F) then P(E|X:T) is relatively unconstrained, so the possibility of Simpson’s ‘reversal’ is perhaps not surprising. Maybe it is thinking of probabilities as always precise, disregarding their uncertainty, that is the problem. In Simpson’s paradox one might think in terms of illustrative mid-point precise probabilities, but be aware of the imprecision. Then the reversal would not be ‘expected’, but neither would it be surprising.
Pearl provides no justification for the view that probability is just an number, but rather takes this as an assumption. He provides some important material, illustrating his causal approach as an important step beyond the standard approach, being both more powerful and less liable to misunderstanding. It may be that this is the best approach where:
- One has sufficient control over the situation, as in an idealised experiment, for the assumptions to be reasonable.
- In particular, one can be reasonably sure that one has identified all relevant factors, or at least is able to protect against indirect confounding factors.
- That considering precise probabilities is justified and adequate.
But otherwise I am unconvinced the approach is always the best, or reliable. in particular it may be preferable to caveat probability estimates, use some less misleading concept such as proportion or imprecise probability, thus acknowledging uncertainty.