# Possible Probabilities

**Initial Draft**

These is my living attempt to rationalise the confusing mess of current theories, which seem to me to assume too much to be applicable or too little to justify their conclusions, or simple to be inaccessible or confusing.

## Aim

Here we seek a reference theory, one that does not rely on contentious assumptions and yet is not too complicated or incomprehensible. It could be used directly, or as something to be referred to in developing special cases for particular domains, or in assessing the applicability of, or in interpreting the findings of, other theories, particularly their ‘uncertainties’.

It is hoped that it is sufficiently resembles existing theories that some understandings and results can be transferred, at least by analogy.

## Outline

The most familiar concept is where the probability of a thing A, conditional on a thing, B – denoted P(A|B) – is consider to be a precise number, p, in the interval [0,1], thus:

P(A|B) = p.

*We assume a setting where such a precise probability exists, but we do not necessarily know precisely what it is*. Largely for technical convenience we consider the ‘things’ to be propositions rather than events. We develop the concept in two stages: Dirac’s functional spaces and Good’s context. This gives:

P< | :C> is a context-dependent set of possible conditional probability functions, with:

P<A|B:C> ≡ {p(A|B) | p( ) ∈ P< | :C>}.

We then ‘lift’ Kolmogorov’s axioms and theorems about precise probabilities to theorems about possible probabilities, P< | : >. This theory resembles Kolmogorov’s, but is more flexible and allows us to record dependencies and assumptions.

## Derivation

Here we simply state the definitions. Proofs that they ‘make sense’ are provided in the elementary theory.

Dirac’s bra-ket notation <a|b>usually denotes a product of function-space mappings. We adapt it to probability functions, which are normalised functions. If ‘1’ is the proposition that is always true, then for a fixed context, C, we can – for the purpose of this exposition – take:

Pu<A> ≡ P<A|’1′:C>

and start by considering unconditional probability functions, p( ), on propositions, thus:

Pu< > is a context-dependent set of possible unconditional probability functions, with:

Pu<A>≡ {p → p(A) | p( ) ∈ Pu< >}.

The key insights is that for any ‘reasonable’ *op* we can define:

Pu *op* Pu** = **{p → p(A) *op* p(B) | p( ) ∈ Pu<>}.

Hence we may define the conditional probability,

Pc<A|B> ≡ Pu<A>/Pu<B>, when Pu<B> ≠ 0,

noting that:

Pc<A|B> = {p → p(A|B) | p( ) ∈ Pc< | >},

where Pc< | > is the usual extension to conditional probabilities of of Pu< >.

But this Pc< > assumes a fixed context. We may use the notation

P< | :C>

to record the dependency on C.

## Motivational Comments

It would be more conventional, and simpler to start with something like

Pr{A}≡ {p(A) | p( ) ∈ Pu< >}

or

Pi[A] = [min(Pr{A}),max(Pr{A})],

but these lose precision. One could try using simply a representative point, perhaps by taking the average with respect to some weighting distribution over Pu< :C>, but this would not always be correct. One could add in an indication of the ‘probable error’, but that would not be precise.

Boole’s approach is to represent Pu< :C> by a parametrised function, depending on the application, and then to solve for the parameters. Where one has an appropriate setting this may be seen as a reasonable approach, relying on our reference theory for its correctness.

If one wishes to identify a precise probability using heuristic such as Laplace’s Principle of Indifference, one can always do so and note the use in the context, C.

Alternatively, a more natural information-theoretic approach would be to find a precise p that minimises the worst-case cross-entropy for p against q in P< : C> and then to report the worst-case relative entropy as an indicator of uncertainty. This would seem to be a principled, neutral, summary of P< : C>. A refinement is to restrict consideration to specific propositions. (The cross-entropy of p,q is the entropy of p plus the relative cross-entropy of p,q.)

## Interpretation

The meaning of P< | :C> and P<A|B:c> are straightforward: probability functions and values are included whenever they are possible according to our assumptions, data and heuristics.

One might reasonably say

P(A|B) = p

when p is a possible probability value and all possible values are sufficiently close to p, depending on the context. Similarly, one might say P( ) = p( ) if p( ) is a possible probability function and for all possible probability functions q( ), the relative entropy for q against p (i.e. the Kullbach-Leibler divergence, D_{KL}(q||p) ) is sufficiently small, where the entropy is taken with respect to some basis determined by the context. But otherwise, it may be misleading to make such a statement.

For example, if P(Outcome|Treatment) is known precisely for the general population but you are interested in a specific Group, then it is not necessarily the case that P(Outcome|Treatment∧Group) = P(Outcome|Treatment). Rather,

If P[Group] ≤ P[Outcome|Treatment] then p(Outcome|Treatment∧Group) = 1 is a possible probability value (unless it conflicts with some other constraints).

If you are in a Group that is rarer than any outcome and the probability of outcome could depend on sub-groups (e.g. genetics) then – for you – P(Outcome|Treatment) could be anything. The general data is at best a crude indication.

## Muddling

Paradigmatic examples of probability are urns, where balls are drawn out, with or without replacement. A variation is where one draws out coins or other randomising devices, which one then samples, with or without replacement. In either case one gets a stable probability distribution. But in some situations of interest the statistics are only piece-wise stable. It is if the same randomising device is used for a while, until another is selected and used.

Simple cases can be modelled by a virtual randomising device. For example,

P(Heads|Heads) = P(Tails|Tails) ≈ 1

gives the effect of an urn with double headed and double tailed coins, occasionally swapped. Thus the model above can incorporate some muddled behaviour. The situation resembles consideration of outcomes for treatments for different groups, above, where we suppose that probabilities may depend on timing.

*I need to ponder some more on this, and maybe develop the theory some more. I certainly do not want to rule out more radical muddling, cf Binmore. We seem to need a vague type of ‘group’, at least.*

## Conclusion

*It seems to me that this gives the core concepts, that can be adapted or extended for many applications where a more conventional approach could mislead.*

## See Also

For more on possible probabilities:

- Elementary Theory,
- Examples.
- Relationship to conventional theory:

More generally, my notes on reason and uncertainty, including:

- conventional precise probability (with critiques),
- other theories of broader (imprecise) probability,

and my notes on mathematics.

*Draft: More to come*.