F C T S S T R N G R T H N F C T N
The fact that we can reconstruct the meaning of the message from these symbols alone shows that what was left out did not convey any information that was essential to the communication, i.e., the vowels and spaces were redundant for this message. (Don't push this example too far ... one could get through a first grade reader without vowels or spaces, but I doubt whether one could handle such an abridged version of Finnegans Wake [James Joyce]). If redundancy is something that exists and can be compared [ first grade reader > Finnegans Wake ], then we should be able to precisely define it and then measure it.
As with any other mathematical treatment of a real world concept, we will create a mathematical model of the situation and make our definitions and take our measurements with respect to that model. How well this corresponds to the real world is then a question of how well does the model fit, and if it is a good model we can tinker with it until we get whatever fitness we need.
Rather than dealing with redundancy directly, let's consider the other side of the coin: information. Rather than attempting a definition, consider what our intuition tells us about information. Consider a horse race. If our friendly neighborhood bookie gives us a tip - say, Finnegans Wake is a sure thing in the 5th at Pimlico - then we would say that we have some information about that race compared to not having this tip. Without the tip we have no information concerning the race, our uncertainty about the outcome is maximal and the most rational thing we can say about the outcome of the race is that each horse has the same chance of winning. With the information (if we trust the source), our uncertainty is diminished and the outcome no longer has an equiprobable distribution. If we had received the tip after the race was over then it would have had no informational content because we would now be certain about the outcome. What we see here is a reciprocal relationship between information about the outcome and our uncertainty of the outcome. The more information we have the less uncertain we are. Uncertainty is a concept that we can handle mathematically with the theory of probabilities, so in the model we are creating we will formally identify information with the reciprocal of uncertainty. This identification pares away much of the semantical content of the concept of information but leaves us with a quantifiable aspect of that concept. Open to claims that we are tossing out the baby with the bathwater, the vindication of this identification will come with the usability of the model we create.
Uncertainty in a physical system is a well-known concept. The measurement of this uncertainty or randomness is called entropy by the physical scientists. Entropy is the subject of one of the most fundamental of physical laws, the 2nd Law of Thermodynamics. Claude Shannon, with brilliant insight, saw this connection with information theory and called the measure of information entropy also. Before defining this measure, we need to make precise the idea of what messages we are going to try to measure for information content.
We think of the source of our messages as a process that emits consecutive symbols from a finite alphabet. Each symbol has a particular probability of being emitted at any precise time. These probabilities depend upon what has already been emitted. For instance, if our source is producing English and the last two letters emitted were a "t" and an "h," then the probability of the next letter being a "p" is very low while that for an "e" is much higher, but if the last two letters were "o" and "o" then the probability of a "p" is higher than that of an "e." Such a process is called a Markov process and may be classified by how much of the previous history is needed to determine the probabilities of the next symbol to be emitted. Thus, a 4th order Markov process requires knowing the last 4 symbols before the probability of the next symbol can be calculated. As a special case, a 0th order Markov process assigns the probabilities without reference to what has gone before. A property that we shall require of our Markov process source is that it be ergodic. Ergodicity has a difficult technical definition, but its meaning can be made clear. A process is said to be ergodic if almost all of its output strings eventually have the same statistical properties. That is, after the process has run for a while, any output string will have the same frequency counts and distribution patterns as any other (with exceptions being so rare as to be disregarded). This assumption makes the computational aspects of the Markov process tractable and there is some evidence from cryptology that natural languages come close to being ergodic in nature. To build a source for a natural language such as English we proceed as follows: We consider a series of ergodic Markov sources of increasing order. As a 0th order source we take as the probabilities for the symbols the relative frequency of the letters in the language. For a 1st order source we use the relative frequency of letter pairs (digrams) together with the probabilities of the 0th order source to calculate the conditional probabilities (i.e., the probability that the next letter is a "k" if the first letter is a "c" for instance) used in the 1st order process. Then using the relative frequency of trigrams we can construct a 2nd order Markov process. Theoretically we can use the statistics of the language to create higher and higher order Markov processes. Now, passing to the limit as the order goes to infinity gives us an ergodic Markov process for our natural language. It has been estimated that the limit is practically achieved around the 32nd order process (i.e., letters more than 32 positions away have no discernible effect on the choice of the next letter) for an English source.
To make this discussion a little more concrete consider the following "approximations" to the English language generated by Markov processes. In these examples we use a 27-letter alphabet, the 26 English letters and a space. A 0th order process with the outcomes equiprobable (i.e., the probability of any letter appearing is 1/27) would give output like this:
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD ZEWRTZYNSADXESYJRQY WGECIJJ OBVKRBQPOZBYMBUAWVLBTQCNIKFMP MKVUUGB M DM QASCJDGFOZYNX ZSDZLXIKUDA 0th order process with probabilities assigned to letters as the relative frequency they have in the English language results in:
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYATH EEI ALHENHTTPA OOBTTVA NAH BRL OR L RW NILI E NNSBATEI AI NGAE ITF NNR ASAEV OIE BAINTHA HYROO POER SETRYGAIETRWCO EHDUARU EU C FT NSREM DIY EESE F O SRIS R UNNASHORNotice how the "words" are about the right length and the proportion of vowels to consonants is more realistic. A first order process with the probabilities calculated from the relative frequency of digrams would give:
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONSIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBEAnd here is a 2nd order process based on the relative frequency of trigrams:
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE RETAGIN IS REGOACTIONA OF CREWhile it is possible to continue in this vein to get higher order processes, the computational problem of determining the relative frequencies in English suffers from combinatorial explosion and becomes impractical. We can however get a glimpse of the higher order processes by using words instead of letters as the symbols for the process. Based on the relative frequencies of words in the English language we can get from a word 0th order process:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE HAD MESSAGES BE THESEAnd from a word 1st order process:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHOEVER TOLD THE PROBLEM FOR AN UNEXPECTED
The basic frequencies used in the above examples are found in the literature. Letter, digram
and trigram frequencies have been tabulated by cryptologists and can be found for example in
Secret and Urgent by Fletcher Pratt, Blue Ribbon Books, 1939. Word frequencies are
tabulated in Relative Frequency of English Speech Sounds, G.Dewey, Harvard University
Press, 1923. Because the calculation of higher order frequencies is so difficult, many Monte
Carlo methods have been suggested for obtaining higher order processes. Using such a
procedure, the following is obtained from a 3rd order word process:
THE BEST FILM ON TELEVISION TONIGHT IS THERE NO-ONE HERE WHO HAD A LITTLE BIT OF FLUFFIt is thus not a ridiculous approximation to regard a natural language, such as English, as a limit of some succession of Markov sources.
for some positive constant , and by adjusting this constant we may choose any base for the logarithms. Note that while these requirements seem reasonable, there are other sets of equally reasonable requirements that could give more flexibility in the form of this function and other functions have been used in the literature.
We can now define the entropy of a 0th order Markov process where the probability of the appearance of the symbol i is pi by:
![[Entropy Definition]](lc1eq3.gif)
The base 2 logarithms are fairly standard practice these days but the choice is arbitrary. The units of this measure are called bits (not to be confused with the term bit as it is used by computer scientists - although, as we shall see below, in an important special case the two concepts coincide). If natural logarithms had been used we would call the unit a nat. For base 10 (common) logarithms the unit is a Hartley (after R.V. Hartley who in 1928 suggested the use of logarithms for the measure of information).
Consider some properties of this function. If one of the probabilities in the sum is 0 then we have introduced a 0-infinity form. This is dealt with either by taking the limit of the term (which is 0) or restricting the sum to only those events that have positive probability. The function takes its maximum value (for fixed k) iff all the probabilities are equal (try a little calculus) in which case the value of the entropy is log k. The function is always nonnegative and equals zero only in the case that one probability is 1 and the remaining are 0 (the sum of the probabilities must be 1). This just reflects the fact that there is no uncertainty in a sure thing. In the special case that there are just two symbols, (say 0 and 1) each with a probability of .5, the entropy of the process is 1 bit. Thus, a bit corresponds to the amount of information in a situation with two equally likely outcomes. It is here that the information theoretic bit and the computer scientist bit coincide (when the need arises we can call the comp. sci. term a binit), but if the probabilities are changed then a binit will contain less than a bit of information.
We can use property 4 to extend the definition of entropy to higher order Markov processes. For an mth order process, the probabilities can be computed if we know the previous m outputs. Thus we can calculate the entropy using the above formula for each string of m symbols and then sum these entropies weighted by the probability that that particular string of m symbols appears. This will give us the entropy of the mth order process. A numerical example should make this clear. Suppose that we have a two symbol alphabet (0 and 1) and a 1st order Markov process where the probability of a 0 following a 0 is 1/2 but following a 1 is 1/3. We can calculate from this that the probability of a 0 is 2/5 (and so, for a 1 would be 3/5). Given a 0, the entropy for the next symbol would be
H0 = -( .5 log(.5) + .5 log(.5)) = - ( .5(-1) + .5(-1)) = - (-1) = 1
and given a 1 we have:
H1 = - ( (1/3)log(1/3) + (2/3)log(2/3)) = - ((1/3)(-1.58) + (2/3)(-.58))
= - ( -.526 + -.386) = .912
The entropy for this 1st order process is thus
H = .4 H0 + .6 H1
H = (.4)(1) + (.6)(.912) = .9472 bits/letter.
For a fixed alphabet, the entropies of higher order processes form a decreasing sequence, which being bounded from below (by 0) has a limit. This limit would be the entropy of a natural language being modeled by the limit of Markov processes. Although clearly defined, there is no effective way to use this definition to compute the entropy of say English. Various attempts to approximate this entropy have placed its value at about 1 bit per letter.
It should be noted that entropy is not a measure that can be applied to individual messages, it
is a statement about the information rate of a source and so refers to all messages coming
from that source. Also, remember the reciprocal relationship between information and
uncertainty. The lower the entropy, the higher the informational content.
Redundancy = 1 - (H/log k).
With this measure we see that a 0th order process on two letters with equal probabilities (i.e., bit strings) has redundancy 0 (H = 1, log 2 = 1) as we mentioned earlier. English would have a redundancy of about .75 (taking H = 1 and log 27 ~ 4), or 75%. A word of caution about this figure, while it is true that the language can be compressed to about 1/4 of its size without loss of meaning, this compression has to be done carefully because of the way redundancy has been built into the language. A simple random removal of 3/4 of a message will not generally leave enough to be comprehensible.