Shannon’s Math. Th.

Claude Shannon  and W. Weaver  A Mathematical Theory of Communication University of Illinois Press, 1963.

A classic, based on Shannon’s classified Communication Theory of Secrecy Systems, 1946.

The key is to treat message generating processes as if they were stochastic and ergodic, so that ‘probability’ is well defined and measurable. Then entropy (based on probabilities) is the appropriate measure of ‘uncertainty’ and the difference in entropy is the appropriate measure of the ‘amount’ information provided by a message. This is always positive.

There is a relatively metaphysical introduction by Weaver to the more engineering main body of Shannon.

Recent Contributions to the Mathematical Theory of Communication

Warren Weaver

Abstract

Shannon has naturally been specially concerned to push the applications to engineering communication, while Weiner has been more concerned with biological application (central nervous system phenomena etc.).

I Introductory Note …

1.2 Three Levels of Communications Problems

These are:

A: Technical.
B: Semantic.
C: Effectiveness.

[I]f Mr. X is suspected not to understand what Mr. Y says, then it is theoretically not possible , by having Mr. Y do nothing but talk further with Mr. X, completely to clarify the situation in finite time.

So stated, one would be inclined to think that Level A is a relatively superficial one, involving only the engineering details of good design of a communication system; while B and C seem to contain most if not all of the philosophical content of the general problem of communication.

The mathematical theory of the engineering aspects of communication, as developed chiefly by Claude Shannon at the Bell Telephone Laboratories, admittedly applies in the first instance only to problem A, namely, the technical problem of accuracy of transference of various types of signals from sender to receiver. But the theory has, I think, a deep significance which proves that the preceding paragraph is seriously inaccurate. Part of the significance of the new theory comes from the fact that levels B and C, above, can make use only of those signal accuracies which turn out to be possible when analyzed at Level A. Thus any limitations discovered in the theory at Level A necessarily apply to levels B and C. But a larger part of the significance comes from the fact that the analysis at Level A discloses that this level overlaps the other levels more than one could possibly naively suspect. Thus the theory of Level A is, at least to a significant degree, also a theory of levels B and C. I hope that the succeeding parts of this memorandumwill illuminate and justify these last remarks.

II Communication Problems at Level A [Technical]

2.2 Information

Weaver notes that Shannon assumes that the source is stochastic and even ergodic.

… Suppose that two persons choose [random] samples in different ways, and study what trends their statistical properties would show as the samples become larger. If the situation is ergodic, then those two persons, however they may have chosen their samples, agree in their estimates of the [statistical] properties of the whole. Ergodic systems, in other words, exhibit a particularly safe and comforting sort of statistical regularity.

III The Interrelationship of the Three Levels of Communication Problems

3.2 Generality of the Theory of Level A [Technical]

[W]hen one moves to levels B [semantic] and C [effectiveness], it may prove to be essential to take account of the statistical characteristics of the destination. One can imagine [a] “Semantic Receiver” interposed between the engineering receiver (which changes signals to messages) and the destination. This semantic receiver subjects the message to a second decoding, the demand on this one being that it must match the statistical semantic capacities of the totality of receivers, or of that subset of receivers which constitute the audience one wishes to affect.

Similarly one can imagine … , inserted between the information source and the transmitter …  “semantic noise”. From this source is imposed into the signal the perturbations or distortions of meaning which are intended by the source but which inescapably affect the destination. It is also possible to think of an adjustment of the original message so that the sum of message meaning plus semantic noise is equal to the desired total message meaning at the destination.

[A] general theory at all levels will surely have to take account not only the capacity of the channel but also … the capacity of the audience.

[L]anguage must be designed (or developed) with the totality of the things that man [sic] may wish to say; but not being able to accomplish everything, it should do as well as possible as often as possible. That is to say, it should deal with its task statistically.

The concept of the information to be associated with a source leads directly, as we have seen, to a study of the statistical structure of language; and this study reveals about the English language, as an example, information which seems surely significant to students of every phase of language and communication. The idea of utilizing the powerful body of theory concerning Markoff processes seems particularly promising for semantic studies, since this theory is specifically adapted to handle one of the most significant but difficult aspects of meaning, namely the influence of context. One has the vague feeling that information and meaning may prove to be something like a pair of canonically conjugate variables in quantum theory, they being subject to some joint restriction that condemns a person to the sacrifice of the one as he insists on having much of the other.

Weaver notes, but does not motivate, the ergodic hypothesis at the technical level and warns against too simplistic an interpretation of Shannon’s mathematics, particularly at the higher levels.

Note that the recommendation that “language must be designed [to] do as well as possible as often as possible. That is to say, it should deal with its task statistically” is only sensible for ergodic sources, where the future is a variation on the past. If one has occasional crises (e.g., financial crashes) then one may want a language that is adequate (in anticipating crises) especially just before potential crises, rather than a language that is more precise almost all of the time but cannot represent crises.

The Mathematical Theory of Communication

Claude E. Shannon

Almost identical to Shannon’s 1948  A Mathematical Theory of Communication.

I Discrete Noiseless Systems

2. Discrete Source of Information

We can think of a discrete source as generating the message, symbol by symbol. It will choose successive symbols according to certain probabilities depending, in general, on preceding choices as well as the particular symbols in question. A physical system, or a mathematical model of a system which produces such a sequence of symbols governed by a set of probabilities, is known as a stochastic process. We may consider a discrete source, therefore, to be represented by a stochastic process. Conversely, any stochastic process which produces a discrete sequence of symbols chosen from a finite set may be considered a discrete source. This will include such cases as:

1. Natural written languages such as English, German, Chinese.

3. The Series of Approximations to English

Shannon considers sequences of letters generated by Markoff processes of increasing sophistication, looking at character and word frequencies singly, in pairs and so on. He shows that these increasingly resemble English.

5. Ergodic and Mixed Sources

As we have indicated above a discrete source for our purposes can be considered to be represented by a Markoff process. Among the possible discrete Markoff processes there is a group with special properties of significance in communication theory. This special class consists of the “ergodic” processes and we shall call the corresponding sources ergodic sources. Although a rigorous definition of an ergodic process is somewhat involved, the general idea is simple. In an ergodic process every sequence produced by the process is the same in statistical properties. Thus the letter frequencies, digram frequencies, etc., obtained from particular sequences, will, as the lengths of the sequences increase, approach definite limits independent of the particular sequence. Actually this is not true of every sequence but the set for which it is false has probability zero. Roughly the ergodic property means statistical homogeneity.

Except when the contrary is stated we shall assume a source to be ergodic. This assumption enables one to identify averages along a sequence with averages over the ensemble of possible sequences (the probability of a discrepancy being zero). For example the relative frequency of the letter A in a particular infinite sequence will be, with probability one, equal to its relative frequency in the ensemble of sequences.

6. Choice, Uncertainty and Entropy

We have represented a discrete information source as a Markoff process. Can we define a quantity which will measure, in some sense, how much information is “produced” by such a process, or better, at what rate information is produced? Suppose we have a set of possible events whose probabilities of occurrence are p1, p2, … ,  pn. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome?

If there is such a measure, say H(p1, p2, … ,  pn), it is reasonable to require of it the following properties:

1. H should be continuous in the pi.

2. If all the pi are equal, pi = 1/n, then H should be a monotonic increasing function of n. With equally likely events there is more choice, or uncertainty, when there are more possible events.

3. If a choice be broken down into two successive choices, the original should be the weighted sum of the individual values of H. …

The uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed.

Taken literally Shannon seems to suppose that, at least in his writing, he was ‘governed by a set of probabilities’. But all he really needs to assume is that the hypothesis that the messages under consideration were so generated will not be rejected by relevant statistical tests. In particular, he shows that by using a relatively crude Markoff model, based on ‘short-range’ statistics, supports a satisfactory encoding. Use any longer-range correlations would not yield a significant benefit. Thus it is not that authors are necessarily Markoff, but that the Markoff approximation is a good one for his purposes.

The theory presupposes ‘a particularly safe and comforting sort of statistical regularity’. More generally, we can think of communication as having regular and exceptional components, with the regular component being capable of being expressed in a compressed form, leaving the exceptional to consider.

Suppose that ‘all swans are white’, so that P(white|swan) = 1.0.  Then we need only communicate ‘swan’. On the other hand if, for the first time, we see a black swan and just say ‘swan’ the recipient will be misled, expecting there to be a white bird. We need to not only communicate ‘swan’ but also ‘some swans are black’. This latter is changing the recipient’s ‘model’ of the world, in addition to the normal type of communication.

We can think of this in terms of two channels. One is the normal one, assuming regularity, the other is used to promulgate changes to the model used in the normal channel. This resembles the approach of Whitehead for situations in which ‘normality’ is epochal. From a ‘higher’ perspective it may be that changes to the ‘lower’ model are normal at that level. In this case, if ‘information’ is relative to a known context and ‘meaning’ is about things that change the context then, as Weaver says, they are conjugate, as in quantum theory.

From time to time one has ‘market shocks’. Mild changes can be seen as part of the previously assumed probability distributions, and straightforward information theory applies. But some market events change one’s perception of the market and what is probable. These are not events ‘within’ the assumed market, but higher-level events that tell us about the nature of the market. They need to be considered differently, especially if there are insufficient samples to inform a reliable probability estimate.

Where the higher-level is probabilistic one can form an overall entropy (as in 6, above). This overall entropy is then dominated by the higher-order contribution, whereas any statistical estimates will concern the lower-order behaviours. More generally, we should be careful in our interpretation of Weaver’s:

[L]anguage … should do as well as possible as often as possible. That is to say, it should deal with its task statistically.

Sometimes it is the infrequent big changes that matter more than the most sophisticated statistical analysis of that data which seems most relevant.