Speech synthesis

Lecture



Speech synthesis - in a broad sense - restoration of the speech signal shape by its parameters [1] ; in the narrow sense - the formation of a speech signal to a pre-printed [ clarified ] text. Part of artificial intelligence.

The synthesis of speech - above all, is called everything related to the artificial production of human speech.

A speech synthesizer is a structure capable of translating text / images to speech, in software or hardware.

The voice engine is directly the system / core text / command to speech, it can also exist independently of the computer.

Content

[remove]
  • 1 Application of speech synthesis
  • 2 Methods of speech synthesis
    • 2.1 Parametric synthesis
    • 2.2 Compilation synthesis
    • 2.3 Full speech synthesis by the rules
    • 2.4 Subject-oriented synthesis
  • 3 History
  • 4 Present and Future
  • 5 See also
  • 6 Notes
  • 7 Literature
  • 8 References

Speech synthesis application

Speech synthesis may be required in all cases where the recipient of information is a person. But the very quality of speech synthesizer is primarily judged by its similarity to the human voice, as well as the ability to be understood. What directly allows people with impaired vision or just reading to use to listen to written work on a home computer. The simplest synthesized speech can be created by combining parts of recorded speech, which will then be stored in a database. And oddly enough, we already come across this way of synthesizing everywhere, without even paying attention to it at times.

  • Speech synthesis in the text or message code can be used in information and reference systems, to help the blind and dumb, to control a person from the machine.
  • With announcements of the departure of trains and the like.
  • For issuing information about technological processes: in military and aerospace equipment, in robotics, in the acoustic dialogue of a person with a computer.
  • As a sound effect, it is often used in the creation of electronic music.

Speech synthesis methods

All methods of speech synthesis can be divided into groups: [2]

  • parametric synthesis;
  • concatenative, or compilation (compilative) synthesis;
  • synthesis by the rules;
  • subject-oriented synthesis.

Parametric synthesis

Parametric speech synthesis is the final operation in vocoder systems, where the speech signal is represented by a set of a small number of continuously changing parameters. Parametric synthesis is advisable to apply in cases where the set of messages is limited and does not change too often. The advantage of this method is the ability to record speech for any language and any speaker. The quality of parametric synthesis can be very high (depending on the degree of information compression in the parametric representation). However, parametric synthesis cannot be used for arbitrary, not predefined messages.

Compilation synthesis

Compilation synthesis is reduced to composing a message from a previously recorded dictionary of the original elements of the synthesis. The size of the synthesis elements is not less than a word. It is obvious that the content of the synthesized messages is fixed by the volume of the dictionary. As a rule, the number of units of the dictionary does not exceed several hundred words. The main problem in the compilation synthesis is the storage capacity of the dictionary. In this regard, a variety of speech compression / encoding methods are used. Compilative synthesis has wide practical application. In Western countries, a variety of devices (from military aircraft to home appliances) are equipped with voice response systems. Until recently, voice response systems in Russia were mainly used in the field of military equipment; now they are increasingly used in everyday life, for example, in the information services of cellular operators when receiving information about a subscriber’s account status.

Full speech synthesis by the rules

Full speech synthesis by rules (or print text synthesis) provides control of all speech signal parameters and, thus, can generate speech using unknown text in advance. In this case, the parameters obtained in the analysis of the speech signal are stored in memory in the same way as the rules for connecting sounds into words and phrases. Synthesis is implemented by modeling the vocal tract, the use of analog or digital technology. Moreover, in the process of synthesizing, the values ​​of parameters and the rules for the connection of phonemes are entered sequentially at a certain time interval, for example, 5-10 ms. The method of speech synthesis according to the printed text (synthesis by the rules) is based on the programmed knowledge of acoustic and linguistic limitations and does not directly use elements of human speech. In systems based on this method of synthesis, there are two approaches. The first approach is aimed at building a model of the human speech-producing system, it is known as articulatory synthesis . The second approach is formant synthesis according to the rules . The intelligibility and naturalness of such synthesizers can be brought up to values ​​comparable to the characteristics of natural speech.

Speech synthesis by the rules using pre-stored segments of a natural language is a type of speech synthesis by the rules, which became widespread in connection with the emergence of possibilities to manipulate the speech signal in digitized form. Depending on the size of the initial synthesis elements, the following types of synthesis are distinguished:

  • microsegment (microwave);
  • allophonic;
  • diphone
  • semi-syllable;
  • syllabic;
  • synthesis of units of arbitrary size.

Usually semi-syllables are used as such elements - segments containing half of a consonant and half of a vowel adjacent to it. At the same time, it is possible to synthesize speech according to a pre-specified text, but it is difficult to control intonation characteristics. The quality of such a synthesis does not correspond to the quality of natural speech, since distortions often arise at the boundaries of the stitching of diphones. Compilation of speech from pre-recorded word forms also does not solve the problem of high-quality synthesis of arbitrary messages, since the acoustic and prosodic (duration and intonation) characteristics of words vary depending on the type of phrase and the place of the word in the phrase. This position does not change even when using large amounts of memory to store word forms.

Subject-oriented synthesis

Item-oriented synthesis compiles the words recorded in advance, as well as phrases to create complete voice messages. It is used in applications where the diversity of the texts of the system will be limited to a specific topic / area, such as train departure announcements and weather forecasts. This technology is easy to use and has been used for a long time for commercial purposes: it was also used in the manufacture of electronic devices, such as talking watches and calculators. The naturalness of the sound of these systems can potentially be high due to the fact that the variety of types of sentences is limited and close to matching the intonation of the original recordings. And since these systems are limited to the choice of words and phrases in the database, they can no longer be widely used in human activities, only because they are able to synthesize combinations of words and phrases for which they were programmed.

Story

At the end of the 18th century, the Danish scientist Christian Kratzenstein, a full member of the Russian Academy of Sciences, created a model of the human speech tract capable of making five long vowel sounds ( a , uh , o , o ). The model was a system of acoustic resonators of various shapes, which emitted vowel sounds with the help of vibrating reeds excited by the air flow. In 1778, Austrian scientist Wolfgang von Kampelen supplemented the Kratzenstein model with tongue and lip models and introduced an acoustic-mechanical speaking machine capable of reproducing certain sounds and their combinations. Hissing and whistling were blown with the help of special hand-operated fur. In 1837, scientist Charles Wheatstone ( Charles Wheatstone ) presented an improved version of the machine, capable of reproducing vowels and most consonant sounds. And in 1846, Joseph Faber ( Joseph Faber ) demonstrated his speaking organ Euphonia , in which an attempt was made to synthesize not only speech, but also singing.

At the end of the XIX century, the famous scientist Alexander Bell created his own “talking” mechanical model, very similar in design to the Wheatstone machine. With the advent of the 20th century, the era of electric cars began, and scientists were able to use sound wave generators and build algorithmic models on their basis.

In the 1930s, a Bell Labs employee Homer Dudley ( Homer Dudley ), working on the problem of finding ways to reduce the bandwidth needed in telephony to increase its transmitting capacity, develops VOCODER (short for English. Voice - voice, English. Coder - encoder ) - keyboard-controlled electronic analyzer and speech synthesizer. Dudley's idea was to analyze the voice signal, disassemble it into parts and re-synthesize it into a less demanding bandwidth. An improved version of the vocoder Dudley, VODER, was presented at the 1939 New York World Exposition [3] .

The first speech synthesizers sounded rather unnatural, and often they could hardly make out the phrases they produced. However, the quality of synthesized speech has been constantly improved, and the speech generated by modern speech synthesis systems is sometimes indistinguishable from real human speech. But despite the successes of electronic speech synthesizers, research in the field of creating mechanical speech synthesizers is still being conducted, for example, for use in humanoid robots. [four]

The first speech synthesis systems based on computer technology began to appear in the late 1950s, and the first text-to-speech synthesizer was created in 1968.

Present and future

So far, it is too early to talk about some promising future for the coming decades to synthesize speech by the rules , since the sound still reminds most of all the speech of robots, and in some places it is also difficult to understand speech. What we can accurately determine is that whether the speech synthesizer speaks in a male or female voice, and sometimes we still do not distinguish the subtleties inherent in the human voice. Therefore, the technology of development, partially turned away from the actual construction of the synthesis of speech signals, but still continues to use the simplest segmentation of voice recording.

created: 2014-08-25
updated: 2021-11-28
132740



Rating 7 of 10. count vote: 2
Are you satisfied?:



Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Automatic speech synthesis

Terms: Automatic speech synthesis