6.4 Evaluation of information in the text

Lecture

It is supposed to do the following experience. On 32 cards write all the letters of the Russian alphabet. After thoroughly mixing the cards, they are retrieved at random, write a letter, return the card to the box, mix again, remove the card, write the letter, etc. Having done this procedure 30-40 times, we get a set of letters. The mathematician R. Dobrushin as a result of this experiment received a set of letters given in the first line of the table. one.

Table 1

Phrases	Phrase	Terms of receiving the phrase
one	SUCHERROBYDA YAHVUI SAIGTLFWNZAGO ENVShTUR PCHGBKUCHTZHU RYAMCHYHRYS	Equal probability all letters of the alphabet and spacing between words
2	ANNT UIYABA OERV ODG THE EVILTSHA	The probabilities of individual letters and spaces between words are taken into account.
3	HAVE FUN ABOVE NOT DRY AND NEPO AND KORKO	Probabilities taken into account 4 letter combinations
four	INFORMATION THEORY ALLOWS TO STUDY THIS IS THE PROPERTY OF REAL	Met real probabilities of combining all letters

The alternation of letters is random, chaotic. The entropy of the text is great. According to the proposed method, the probability of extracting any of the letters is the same, i.e.

W _A = W _B = ... = W _I = 1/32

The probability of extracting a blank card (the gap between words) is also equal to 1/32: one space falls on 32 letters.

The entropy of the appearance of each next letter in the text is calculated by Shannon's formula

6.4 Evaluation of information in the text

If the probabilities of the appearance of letters are the same, W _А = W _Б = ... = W _Я , then we get the entropy I ~ 5 bits.

In real texts, the frequency of occurrence of each letter and the intervals are different. In tab. 2 shows the frequency of W _i letters in Russian. Because of the different likelihood of the appearance of different letters in real texts, their entropy is less than in the first experiment. In the second experiment, not 32 cards are placed in the box anymore, but more: the number of cards is proportional to the probabilities of letters. For example, on 1 card with the letter F (W _F = 0.002) there are 45 cards with the letter O (W _O = 0.090). Then, as in the first experiment, the cards are pulled out and returned. As a result, phrase 2 (Table 1) appears, which is more ordered.

table 2

Frequency of letters W _i in Russian
Space 0,175	Р 0.040	I am 0,018	X 0.009
About 0.090	B 0.038	S 0,016	W 0.007
E, E 0.072	L 0035	W 0.016	S 0,006
A 0.062	K 0.028	B, b 0,014	W 0,006
And 0.062	M 0.026	B 0.014	W 0,003
T 0,053	D 0.025	G 0,013	Sch 0.003
H 0,053	G 0,023	H 0,012	E 0.003
From 0.045	At 0.021	Th 0,010	F 0,002

First of all, the absurdly long words disappeared from the text.

Secondly, in phrase 2, vowels and consonants alternate more evenly, but, nevertheless, not everything can even be read, not to mention the meaning.

Substitute in the formula of Shannon the probability of occurrence of individual letters

I ₁ = - 0.175 log ₂ 0.175 - 0.090 log ₂ 0.090 - ... - 0.002 log ₂ 0.002 = 4.35 bits.

The amount of information in a message per one letter has decreased from 5 to 4.35 bits, because we have information about the frequencies of occurrence of letters.

But in the language there is a frequency dictionary, where not only the frequencies of individual letters, but also their combinations (paired, triple, etc.) are taken into account. If we take into account the probability of 4 letter combinations in the Russian text, we get the phrase 3 (Table 1).

As more and more lengthy correlations are taken into account, the similarity of the “texts” obtained with the Russian language increases, but the meaning is still far away.

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote

6.4 Evaluation of information in the text

Comments

To leave a comment

Synergetics

Terms: Synergetics