Artificial Intelligence?

The subject of ‘Artificial Intelligence’ (AI) has long provided ample scope for long and inconclusive debates. Wikipedia seems to have settled on a view, that we may take as straw-man:

Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it. [Dartmouth Conference, 1956] The appropriately programmed computer with the right inputs and outputs would thereby have a mind in exactly the same sense human beings have minds. [John Searle’s straw-man hypothesis]

Readers of my blog will realise that I agree with Searle that his hypothesis is wrong, but for different reasons. It seems to me that mainstream AI (mAI) is about being able to take instruction. This is a part of learning, but by no means all. Thus – I claim – mAI is about a sub-set of intelligence. In many organisational settings it may be that sub-set which the organisation values. It may even be that an AI that ‘thought for itself’ would be a danger. For example, in old discussions about whether or not some type of AI could ever act as a G.P. (General Practitioner – first line doctor) the underlying issue has been whether G.P.s ‘should’ think for themselves, or just apply their trained responses. My own experience is that sometimes G.P.s doubt the applicability of what they have been taught, and that sometimes this is ‘a good thing’. In effect, we sometimes want to train people, or otherwise arrange for them to react in predictable ways, as if they were machines. mAI can create better machines, and thus has many key roles to play. But between mAI and ‘superhuman intelligence’  there seems to be an important gap: the kind of intelligence that makes us human. Can machines display such intelligence? (Can people, in organisations that treat them like machines?)

One successful mainstream approach to AI is to work with probabilities, such a P(A|B) (‘the probability of A given B’), making extensive use of Bayes’ rule, and such an approach is sometimes thought to be ‘logical’, ‘mathematical, ‘statistical’ and ‘scientific’. But, mathematically, we can generalise the approach by taking account of some context, C, using Jack Good’s notation P(A|B:C) (‘the probability of A given B, in the context C’). AI that is explicitly or implicitly statistical is more successful when it operates within a definite fixed context, C, for which the appropriate probabilities are (at least approximately) well-defined and stable. For example, training within an organisation will typically seek to enable staff (or machines) to characterise their job sufficiently well for it to become routine. In practice ‘AI’-based machines often show a little intelligence beyond that described above: they will monitor the situation and ‘raise an exception’ when the situation is too far outside what it ‘expects’. But this just points to the need for a superior intelligence to resolve the situation. Here I present some thoughts.

When we state ‘P(A|B)=p’ we are often not just asserting the probability relationship: it is usually implicit that ‘B’ is the appropriate condition to consider if we are interested in ‘A’. Contemporary mAI usually takes the conditions a given, and computes ‘target’ probabilities from given probabilities. Whilst this requires a kind of intelligence, it seems to me that humans will sometimes also revise the conditions being considered, and this requires a different type of intelligence (not just the ability to apply Bayes’ rule). For example, astronomers who refine the value of relevant parameters are displaying some intelligence and are ‘doing science’, but those first in the field, who determined which parameters are relevant employed a different kind of intelligence and were doing a different kind of science. What we need, at least, is an appropriate way of interpreting and computing ‘probability’ to support this enhanced intelligence.

The notions of Whitehead, Keynes, Russell, Turing and Good seem to me a good start, albeit they need explaining better – hence this blog. Maybe an example is economics. The notion of probability routinely used would be appropriate if we were certain about some fundamental assumptions. But are we? At least we should realise that it is not logical to attempt to justify those assumptions by reasoning using concepts that implicitly rely on them.

Dave Marsay

Advertisements

Knightian uncertainty and epochs

Frank Knight was a follower of Keynes who emphasised and popularised the importance of ‘Knightian uncertainty’, meaning uncertainty other than Bayesian probability. That is, it is concerned with events that are not deterministic but also not ‘random‘ in the same sense as idealised gambling mechanisms.

Whitehead‘s epochs, when stable and ergodic in the short-term, tend to satisfy the Bayesian assumptions, whereas future epochs do not. This within the current epoch one has (Bayesian) probabilities, longer term one has Knightian uncertainty.

Example

Consider a Casino with imperfect roulette wheels with unknown biases. It might reasonably change the wheel (or other gambling mechanism) whenever a punter seems to be doing infeasibly well. Even if we assume that the current wheel has associated unknown probabilities that can be estimated from the observed outcomes, but the longer-term uncertainty seems qualitatively different. If there are two wheels with known biases that might be used next, and if we don’t know which is to be used then, as Keynes shows, one needs to represent the ambiguity rather than being able to fuse the two probability distributions into one. (If the two wheels have equal and opposite biases then by the principle of indifference a fused version would have the same probability distribution as an idealised wheel, yet the two situations are quite different.)

Bayesian probability in context

In practice, the key to Bayesian probability is Bayes’ rule, by which the estimated probability of hypotheses is updated depending on the likelihood of new evidence against those hypotheses. (P(H|E)/P(H’|E) = {P(E|H)/P(E|H’)}•{P(H)/P(H’)}.) But estimated probabilities depend on the ‘context’ or epoch, which may change without our receiving any data. Thus, as Keynes and Jack Good point out, the probabilities should really be qualified by context, as in:

P(H|E:C)/P(H’|E:C) = {P(E|H:C)/P(E|H’:C)}•{P(H:C)/P(H’:C)}.

That is, the results of applying Bayes’ rule is conditional on our being in the same epoch. Whenever we consider data, we should not only consider what it means within our assumed context, but whether it has implications for our context.

Taking a Bayesian approach, if G is a global certain context and C a sub-context that is believed but not certain, then taking

P(E|H:G) = P(E|H:C)  etc

is only valid when P(C:G) ≈ 1,

both before and after obtaining E. But by Bayes’ rule, for an alternative C’:

P(C|E:G)/P(C’|E:G) = {P(E|C:G)/P(E|C’:G)}•{P(C:G)/P(C’:G)}.

Thus, unless the evidence, E, is as least as likely for C as for any other possible sub-context, one needs to check that P(C|E:G) ≈ 1. If not then one may need to change the apparent context, C, and compute P(H|E:C’)/P(H’|E:C’) from scratch: Bayes’ rule in its common – simple – form does not apply. (There are also other technical problems, but this piece is about epochs.) If one is not certain about the epoch then for each possible epoch one has a different possible Bayesian probability, a form of Knightian uncertainty.

Representing uncertainty across epochs

Sometimes a change of context means that everything changes. At other times, some things can be ‘read across’ between epochs. For example, suppose that the hypotheses, H, are the same but the likelihoods P(E|H:C) change. Then one can use Bayes’ rule to maintain likelihoods conditioned on possible contexts.

Information 

Shannon‘s mathematical information theory builds on the notion of probability. A probability distribution determines an entropy, whilst information is measured by the change in entropy.  The familiar case of Shannon’s theory (which is what people normally mean when they refer to ‘information theory’) makes assumptions that imply Bayesian probability and a single epoch.  But in the real world one often has multiple or ambiguous epochs. Thus conventional notions of information are conditional on the assumed context. In practice, the component of ‘information’ measured by the conventional approach may be much less important than that which is neglected.

Shocks

It is normally supposed that pieces of evidence are independent and so information commutes:

P(E1+E2|H:C) = P(E1|H:C).P(E2|H:C) = P(E2|H:C).P(E1|H:C) = P(E2+E1|H:C)

But if we need to re-evaluate the context, C, this is not the case unless we re-evaluate old data against the new context. Thus we may define a ‘shock’ as anything that requires us to change the likelihood functions for either past or future data. Such shocks occur when the data had been considered unlikely according to an assumption. Alternatively, if we are maintaining likelihoods against many possible assumptions, a shock occurs when none of them remains probable. 

See also

Complexity and epochs, Statistic and epochs, reasoning in a complex, dynamic, world.

Keynes, Turing and Good developed a theory of ‘weight of evidence’ for reasoning in complex worlds. TBB

David Marsay