Natural language understanding systems

Lecture



Introduction

The process of communicating with the machine for a long time remained the lot of specialists and was inaccessible for understanding mere mortals. Thus, the “mere mortals”, who, strictly speaking, were the users of computer services. Technologists often didn't see the computer itself, but communicated with the machine through an intermediary programmer. ”The computer interface in the early stages of the development of computing technology a mandatory element necessarily included a specialist person (as far as our country is concerned, in some places this situation persisted until the early nineties; that is why we in many offices still have a habit of calling any programmer wow, able to distinguish a couple of keys on the keyboard.Which, of course, by and large did not suit consumers very much.Now, if you could talk to a computer directly,

Prerequisites for the emergence of natural language understanding systems


Few people know how a person communicated with the first computers. It happened this way: the operator, using wires with connectors at the ends, connected triggers (of which, actually, the machine assembled) in such a way that the necessary sequence of commands was executed at startup. Outwardly, it was very much like the manipulation of telephone exchanges at the turn of the century, and in fact it was a very skilled job. It can be said that the programming was carried out not even in machine commands, but at the hardware level. Then the task was simplified: the sequence of necessary commands was written directly into the machine's memory. More productive devices have been used to enter information. At first, these were groups of toggle switches; by switching between them, the operator (or the programmer - then these concepts meant the same thing) could type the necessary command and enter it into the machine's memory. Then came the punch cards. Next - punched tape. The speed of communication with the machine has increased, the number of errors that occur during input has decreased dramatically. But the essence of this communication, its character has not changed. The opportunity to communicate for the first time appeared directly on the so-called small machines. Indelible impressions of familiarity with the interactive interface. It was a monstrous product of the Soviet industry under the poetic name "Nairi". Then the outlandish opportunity to tap on the keyboard directly addressed to the machine command and get a meaningful response seemed a miracle. Especially if, until then, the whole process of communication with the machine assembled in the transfer of a deck of punched cards to the laboratory technician. In order to get this deck with a comment a couple of days later: “You have an error here, the program did not go.” To the users of this kind, the meager dialogue mode of the command line seemed the height of perfection. It is to him that first small computers, and then personal computers, owe much of their triumphal march. Any consumer of computer services could, without going into technical difficulties and having learned only a couple of dozen commands of the operating system, communicate with a computer without intermediaries. Then such a concept as “user” first appeared, and history ascribes the rise and flowering of many computer companies to the emergence of a dialogue mode, such as DEC. And then his majesty appeared graphical interface: there was no need for knowledge of any commands at all, and the user began to communicate with his iron friend in an intuitive sign language. A ghost of sound interface loomed on the horizon ...

Understanding in dialogue


Be that as it may, continue the search for an interface that would suit everyone. The speech interface is now claiming this role. As a matter of fact, this is exactly what humankind has always sought to communicate with the computer. Even in the era of punched cards in science fiction novels, a person with a computer just talked like an equal. Then, in the era of punched cards, or even earlier, the first steps were taken to implement a speech interface. Work in this direction was conducted at a time when no one even thought about the graphical interface. In a relatively short period, an exhaustive theoretical basis was developed, and practical achievements were determined only by the performance of computer equipment. Researchers have not made much progress over the past decades, forcing some experts to be extremely skeptical about the very possibility of implementing a speech interface in the near future. Others believe that the problem is almost solved. However, it all depends on what should be considered a solution to this problem. Constructing a speech interface is divided into three components. The first task is to enable the computer to “understand” what a person is saying to him, that is, he must be able to extract from a person's speech useful information. So far, at the current stage, this task comes down to extracting a semantic part of it, a text (the understanding of such components as, say, intonation, is not considered at all). That is, this task comes down to replacing the keyboard microphone. The second task is to make the computer understand the meaning of what was said. While a voice message consists of a standard set of computer-readable commands (for example, duplicating menu items), there is nothing difficult in its implementation. However, this approach is unlikely to be more convenient than entering the same commands from the keyboard or using the mouse. Perhaps, it is even more convenient to simply click the mouse on the application icon, than to clearly pronounce (besides interfering with others): "Start! Main menu! Word!" Ideally, a computer should clearly “reflect on” the natural speech of a person and understand that, for example, the words “Enough!” and "Finish the job!" mean in one situation different concepts, and in the other - the same thing.

The third task is to enable the computer to convert the information with which it operates into a speech message that is understandable to man. So far, the final solution exists only for the third one. In essence, speech synthesis is a purely mathematical problem that is currently solved good level. And in the near future, most likely, only its technical implementation will be improved. Already there are all sorts of programs for reading aloud text files, dubbed dialog boxes. menu items and I can testify that they can cope with the generation of legible text messages without problems. The final solution to the first problem is that no one still knows how to divide our speech in order to extract from it the components in which contains meaning. In the audio stream that we give out during a conversation, neither individual letters nor syllables can be distinguished: even seemingly identical letters and syllables in different words on the spectrograms look different. Nevertheless, many firms already have their own methods (alas, carefully concealed), which at the very least allow them to solve this problem. In any case, after a preliminary training, modern speech recognition systems work quite tolerably and make no more mistakes than optical printing character recognition systems did five or seven years ago. As for the second task, according to most experts, it cannot be solved without the help of artificial intelligence systems. There are high hopes for the emergence of so-called quantum computers. If such devices appear, it will mean a qualitative change in computing technologies. Therefore, the role of the speech interface is just a duplication of voice commands that can be entered from the keyboard or with the mouse. And here its advantages are doubtful. However, there is one area that for many people can be very attractive. This is speech input to a computer. Indeed, rather than knocking on the keyboard, it is much more convenient to dictate everything to a computer so that it records what is heard in a text file. It does not require that the computer comprehend what was heard, and the problem of translating speech into text is more or less solved. Not without reason, most of the speech interface programs currently being released are focused on the input of speech. Although there is room for skepticism. If you read out loud, clearly pronouncing the words, with pauses, monotonously, as is required for a speech recognition system, it will take me five minutes to type the page. It is difficult to write about the speech interface. On the one hand, the topic is absolutely not new, on the other - the active development and application of this technology is just beginning (again). On the one hand, stable stereotypes and prejudices have been formed, on the other - despite almost half a century of persistent efforts, conceptual issues still confronting the ancestors of speech input have not been resolved. The first question, and perhaps the main one, concerns the scope. Search for apps. where speech recognition could demonstrate all its merits, contrary to established opinion, is not a trivial task. The current practice of using computers does not at all contribute to the widespread introduction of the speech interface. The development of the modern computer industry was carried out under the banner of a graphical interface, an alternative to which it does not exist in the circle of tasks solved by computers today.

Mass applications: CAD, office and publishing packages, DBMS make up the bulk of the intellectual filling of computers, leaving in their current form very little space for using alternative user interface models, including speech. For giving commands related to positioning in space, people always used and will use gestures, that is, the system "hands-eyes". On this principle, a modern graphical interface is built. The prospect of replacing the keyboard and mouse with a speech recognition unit is absolutely no longer. At the same time, the gain from imposing on it a part of the management functions is so small that it could not provide sufficient grounds even for trial implementation in mainstream computers for more than thirty years. The existence of commercially applicable speech recognition systems is estimated by such a period. Today, among the leading manufacturers of speech recognition systems, it is not customary to pay tribute to the achievements of researchers of past years. The reason is clear: this will not only significantly reduce the visible indicators of the progress they have achieved, but also contribute to the emergence of well-founded doubts about the prospects of the approaches being implemented in general. To objectively evaluate the progress of speech recognition technology, compare the characteristics of the systems implemented within the project by 1976 and the systems being promoted to the market now. There are two questions. Why did you not find a worthy application of the development of twenty years ago, and why for such a long period did not occur a visible qualitative shift in the characteristics of specific systems? The answer to the first question is partially stated above: the main problem is in the area of ​​application. Can add. that, contrary to the persistently imposed today for marketing purposes (in particular, for the movement of MMX processors), the high requirements of this technology for computing resources were not the main obstacles to its widespread adoption. The emergence of similar problems with graphics developers led to the creation and mass use of graphics hardware accelerators, rather than the abandonment of the window interface. At the same time, the developed speech adapters do not exceed the cost of the graphic ones.

The answer to the second question is directly related to the first one. Technology that is not used cannot feed itself and ensure its growth. In addition, it is possible that the orientation of most research centers to increase the recognizable vocabulary is erroneous from the point of view of applicability and from the point of view of scientific perspective. Back in 1969, in his famous letter to the editor of the Journal of the Acoustic Society of America, J. Pies, an employee of Bell Laboratories, pointed out the lack of clear progress at that time and the possibility of such progress in speech recognition technology in the near future due to the inability of computers to analyze syntactic, semantic and pragmatic information contained in the statement. The existing barrier can be overcome only with the development of artificial intelligence systems, a direction that in the 1970s confronted the barrier of complexity and is now in complete oblivion. It is difficult to hope for further improvement in the characteristics of speech input devices, given that already in the 70s their ability to recognize speech sounds was superior to human. This fact was confirmed by a series of experiments comparing the confidence of human and computer recognition of the words of a foreign language and meaningless chains of sounds. In the absence of the possibility of connecting pragmatic (semantic), semantic and other analyzers, the person clearly loses. To illustrate the above, perhaps some somewhat controversial statements consider the perspective and main problems of using speech input systems of texts that are actively promoted recently. For comparison: spontaneous speech pronounced with an average speed of 2.5 words per second, professional typewriting - 2 words per second, non-professional -0.4. Thus, at first glance, speech input has a significant advantage in performance. However, the assessment of the average speed of dictation in real conditions is reduced to 0, words per second due to the need to clearly pronounce the words during speech input and a fairly high percentage of recognition errors that need to be corrected.

The speech interface is natural for humans and provides additional convenience when typing texts. However, even a professional announcer may not be happy with the prospect for several hours to dictate to the obscure and dumb (we'll come back to this) computer. In addition, the existing experience of operating such systems indicates a high probability of illness of the operator's vocal cords, which is due to the monotony of speech that is unavoidable during computer dictation. Often the advantages of speech input include the lack of need for prior training. However, one of the weakest points of modern speech recognition systems - sensitivity to clarity of pronunciation - leads to the loss of this seemingly obvious advantage. The operator learns to type on the keyboard for 1-2 months on average. Setting the correct pronunciation can take several years. In addition, additional stress — a consequence of conscious and subconscious efforts to achieve higher recognizability — does not at all contribute to maintaining the normal mode of operation of the operator's speech apparatus and significantly increases the risk of specific diseases. There is another unpleasant limitation of applicability, deliberately not mentioned, my opinion, the creators of speech input systems. An operator who interacts with a computer through a speech interface is forced to work in a separate room, soundproofed, or use a soundproof helmet. Otherwise, it will interfere with the work of its neighbors in the office, which, in turn, creating additional background noise will significantly complicate the work of the speech recognizer. Thus, the speech interface is in clear contradiction with the modern organizational structure of enterprises oriented to collective work. The situation is somewhat mitigated with the development of remote forms of work, but still quite long the most natural for a person productive and potentially massive form of user interface is doomed to a narrow range of applications. Limitations on the applicability of speech recognition systems in the most popular traditional applications lead to the conclusion that it is necessary to search for potentially promising implementation of speech interface applications outside of the traditional office environment, as evidenced by Xia's commercial success of narrowly specialized speech systems.

Examples of natural language processing systems


The most successful commercial application of speech recognition to date is the AT& telephone network. The client can request one of the five categories of services using any words. He speaks until one of the five key words occurs in his statement. This system currently serves about a billion calls a year. This conclusion is in conflict with well-established widespread stereotypes and expectations. disabled people, telephone and information systems, leading developers of speech recognition are stepping up efforts to achieve universalization and increase the volume of the dictionary, even to the detriment of reducing the pre-tuning procedure for the speaker. Meanwhile, it is precisely these applications that impose very low requirements on the volume of the recognized vocabulary, along with severe restrictions imposed on the presetting. Moreover, the recognition of spontaneous continuous speech has practically been marking time since the 1970s due to the computer's inability to effectively analyze the non-acoustic characteristics of speech. Even Bill Gates, who in a sense is the ideal of pragmatism, turned out to be not free from historical stereotypes. Starting in 95-96 with the development of its own universal speech recognition system, it proclaimed the next era of the widespread introduction of the speech interface. Speech tools are planned to be included in the standard delivery of the new version - a purely office operating system. At the same time, the head of Microsoft stubbornly repeats the phrase that it will soon be possible to forget about the keyboard and mouse. He probably plans to sell acoustic helmets like those used by military pilots and Formula 1 pilots with the Windows NT box. Also, is Microsoft going to stop releasing Word, Excel, etc. in the near future? It is more than difficult to control the graphic objects of the screen with your voice without being able to help with your hands. When talking about the speech interface, they often focus on speech recognition, forgetting about its other side - speech synthesis. The recent rapid development of event-driven systems has played a major role in this skew. largely suppressing the attitude towards the computer as the active side of the dialogue. More recently (about thirty years ago), speech recognition and synthesis subsystems were considered as parts of a single speech interface complex. However, interest in synthesis disappeared fairly quickly. Firstly, the developers did not meet even a tenth of the difficulties that they encountered when creating recognition systems. Secondly, unlike recognition, speech synthesis does not demonstrate significant advantages over other means of outputting information from a computer. Almost all of its value lies in the addition of speech input. For a person, dialogue, and not a monologue, is natural and familiar. As a result of underestimating the need for a verbal response, there is an increased fatigue of operators, monotony of speech and limited applicability of the speech interface. How can a computer equipped with a speech recognizer help a blind person if it lacks a non-visual feedback device? The fact of involuntary adjustment of the voice to the voice of the interlocutor is widely known. Why not use this human ability to increase the accuracy of computer speech recognition by correcting the pronunciation of the operator using two-way dialogue? In addition, it is quite possible that properly organized and modulated synthesis can significantly reduce the operator's risk of diseases associated with monotony of speech and additional stress. The ubiquitous penetration of the graphical user interface was ensured through the joint use of a graphic monitor, a means of displaying graphic information,
Speech voicing methods
Now let's say a few words about the most common methods of voicing, that is, about methods for obtaining information that controls the parameters of the created sound signal, and how to form the sound signal itself. The broadest division of strategies used in speech voicing is the division into approaches that are aimed at the current model of the human speech-producing system, and approaches where the task is to model the acoustic signal as such. The first approach is known as articulatory synthesis. The second approach seems to be simpler today, so it is much better studied and practically more successful. Within it, two main areas are distinguished - formant synthesis according to the rules and compilation synthesis. Formant synthesizers use an excitation signal, which passes through a digital filter built on several resonances similar to those of the vocal tract. The separation of the excitatory signal and the transfer function of the vocal tract is the basis of the classical acoustic theory of speech production. Compilative synthesis is carried out by gluing the necessary compilation units from the available inventory. Many systems have been built on this principle, using different types of units and different inventory methods. In such systems, it is necessary to apply signal processing to bring the pitch frequency, energy, and unit duration to those that should characterize the synthesized speech. In addition, it is required that the signal processing algorithm smooth out discontinuities in the formant (and spectral as a whole) structure at the segment boundaries. Compilative synthesis systems use two different types of signal processing algorithms: LP (abbr. English Linear Prediction - linear prediction) and РSOLA (abbreviated English Pitch Sуnchronous Overlap and Add). LP synthesis is based to a large extent on the acoustic theory of speech production, in contrast to PSOLA synthesis, which operates by simply breaking the sound wave that makes up the compilation unit into time windows and transforming them. The PSOLA algorithms make it possible to achieve good preservation of the naturalness of the sound when modifying the original sound wave. which operates by simply breaking up the sound wave constituting the compilation unit into time windows and transforming them. The PSOLA algorithms make it possible to achieve good preservation of the naturalness of the sound when modifying the original sound wave. which operates by simply breaking up the sound wave constituting the compilation unit into time windows and transforming them. The PSOLA algorithms make it possible to achieve good preservation of the naturalness of the sound when modifying the original sound wave.

The most common speech synthesis systems


The most common speech synthesis systems today are obviously those supplied with sound cards. If your computer is equipped with any of them, there is a significant chance that a speech synthesis system is installed on it - alas, not Russian, but English speech, more precisely, its American version. Most original Sound Blaster sound cards come with a Creative Tech-Assist system, and First Byte's Monologue software is often included with sound cards from other manufacturers.

TextAssist is an implementation of a formant synthesizer according to the rules and is based on the DECTalk system, developed by Digital Eguipment Corporation with the participation of the famous American phoneticist Dennis Klan (unfortunately, who passed away early). DECTalk is still a kind of quality standard for speech synthesis of American English. Creative Technologies offers developers to use TechtAssist in their programs with the help of a special TechtAssistApi (AAPI). Supported operating systems - MS Windows and Windows 95; for Windws NT there is also a version of the DESTalk system, originally created for Digital Units. A new version of TechAssist announced by Associative Computing, inc. and developed using DECTa1k and Creative technologies, is at the same time a multilingual synthesis system, supporting English, German, Spanish and French. This is ensured primarily by the use of appropriate linguistic modules, the developer of which is Lernout & Hauspie Speech Products, a recognized leader in the support of multilingual speech technologies. The new version will have a built-in dictionary editor, as well as a specialized TechReader device with push-button control of the synthesizer in different text reading modes.

The MonoPhoto program, designed to sound text that is on the MS Windows clipboard, uses ProPoise. ProVoise is a compilation synthesizer using the optimal choice of speech compression mode and preservation of border sections between sounds, a type of TD-PS0LA. Designed for American and British English, German, French, Latin American Spanish and Italian. The inventory of the compilation segments is of a mixed dimension: the segments are phonemes or allophones. The First Vute company positions ProPoice's system and software products based on it as applications with low consumption of CPU time. FirstByte also offers a powerful PmomoVox articulatory synthesis system for use in telephony applications. For developers: Mono WinWe supports the Microsoft SAPI specification.

The fashion for free products has not passed the area of ​​speech synthesis applications. MBR0LA is the so-called multilingual synthesis system, which implements a special hybrid algorithm of compilative synthesis and works both under PC / Windows 3.1, PC / Windows 95, and under Sun4. However, the system accepts at the input a chain of phonemes, not a text, and therefore is not, strictly speaking, a system of speech synthesis in the text. Tr-Voicé formant synthesizer of the Corporate Communication Corporation (USA) is close to the systems and capabilities described above, but it supports more languages: American English, Latin American, Spanish, German, French, Italian. In addition, a special preprocessor is included in this synthesizer that provides quick preparation for reading email messages, faxes, and databases.

Speech output


Speech output from a computer is a problem no less important than voice input. This is the second part of the speech interface, without which a conversation with a computer can not take place. We mean reading out loud textual information, and not playing pre-recorded sound files. That is, the issuance in speech form of previously unknown information. In fact, due to the synthesis of speech in the text, one more channel of data transmission from a computer to a person opens, similar to what we have through a monitor. Of course, it would be difficult to convey the picture by voice. But in some cases it would be quite convenient to hear an e-mail or a search result in a database, especially if at this time the look is busy with something else.

From the user's point of view, the most reasonable solution to the problem of speech synthesis is the inclusion of speech functions (in the future - multilingual, with translation capabilities) into the composition of the operating system. In the same way as we use the PRINT command, we will use the TALK or SREAK command. Such commands will appear in the menu of commonly used computer applications and in programming languages. Computers will sound menu navigation, read (duplicate by voice) screen messages, file directories, etc. Important note: the user must have sufficient ability to configure the computer's voice, in particular, if desired, be able to turn off the voice altogether.

The above functions and now would not be superfluous for those who have vision problems. For everyone else, they will create a new dimension of computer usability and significantly reduce the burden on the nervous system and on vision. In our opinion, the question now is not whether speech synthesizers in personal computers are needed or not. The question is different - when will they be installed on each computer? It remains to wait, maybe a year or two.

Automatic computer speech synthesis by text


Speech synthesis methods
Now, after an optimistic description of the near future, let's turn to speech synthesis technology itself. Consider some at least minimally meaningful text. The text consists of words separated by spaces and punctuation marks. Speaking of words depends on their location in a sentence, and the intonation of a phrase depends on punctuation marks. Moreover, quite often on the type of grammatical construction used: in some cases, when pronouncing the text, there is an obvious pause, although there are no punctuation marks. Finally, the utterance depends on the meaning of the word! Compare, for example, the choice of one of the options “for mock” or “lock” to for the same word “lock”.

Already starting analysis of the problem shows its complexity. In fact, dozens of monographs have been written on this topic, and a huge number of publications are made monthly. Therefore, we will touch upon only the most general, most important points for understanding.

Generalized functional structure of the synthesizer

The structure of an idealized system of automatic speech synthesis consists of several blocks.

  • - Definition of text language
  • -Normalization of the text
  • -Linguistic analysis: syntactic, morphemic analysis, etc.
  • - Formation of prosody characteristics
  • -The phone transcriptor
  • - Formation of control information
  • -Getting a beep

It does not describe any of the existing real systems, but contains components that can be found in many systems. The authors of specific systems, regardless of whether these systems are already a commercial product or are still under research, pay different attention to individual units and implement them very differently, in accordance with practical requirements.

Linguistic processing module

First of all, the text to be read enters the linguistic processing module. It defines the language (in a multilingual synthesis system), and non-pronounced characters are filtered out. In some cases spellcheckers are used (spelling and punctuation error correction modules). Then the text is normalized, that is, the entered text is divided into words and other character sequences. Characters include, in particular, punctuation marks and paragraph beginning characters. All punctuation marks are very informative. Special sub-blocks are developed for the sound of numbers. Converting numbers to word sequences is a relatively easy task (if you read numbers as numbers, not as numbers that need to be spelled correctly grammatically), but numbers that have different meanings and functions are pronounced differently. for many languages, it is possible to speak, for example, of the existence of a separate pronunciation subsystem of telephone numbers. Careful attention should be paid to the correct identification and sounding of numbers indicating the dates of the month, years, time, telephone numbers, amounts of money, etc. (the list for different languages ​​may be different).

Linguistic analysis

After the normalization procedure, each word of the text (each word form) must be assigned information about its pronunciation, that is, turned into a chain of phonemes or, in other words, create its phoneme transcription. In many languages, including Russian, there are fairly regular rules for reading - rules of correspondence between letters and phonemes (sounds), which, however, may require a prior arrangement of verbal accents. In English, the reading rules are very irregular, and the task of this unit for English synthesis is complicated thereby. In any case, there are serious problems in determining the pronunciation of proper nouns, borrowings, new words, abbreviations and abbreviations.

In addition, cases of graphic homonymy should be considered correctly: the same sequence of alphabetic characters in different contexts sometimes represents two different words / word forms and is read differently (cf. the above example of the word "lock"). It is often possible to solve the problem of the ambiguity of this kind by grammatical analysis, but sometimes only the use of broader semantic information helps.

For languages ​​with fairly regular reading rules, one of the productive approaches to translating words into phonemes is a system of contextual rules that translate each letter / letter combination into a particular phoneme, that is, an automatic phoneme transcriptor. However, the more exceptions to the reading rules in the language, the worse this method works. The standard way to improve the pronunciation of a system is to add a few thousand of the most common exceptions to the dictionary. An alternative approach to the word-letter-phoneme solution implies morphemic analysis of the word and translation of morphs into phonemes (that is, significant parts of the word: prefixes, roots, suffixes and endings). However, due to different borderline phenomena at the juncture of morphs, decomposition into these elements presents significant difficulties. At the same time, for languages ​​with a rich morphology, for example, for Russian, the dictionary of morphs would be more compact. Morphemic analysis is also convenient because with its help it is possible to determine the belonging of words to parts of speech, which is very important for the grammatical analysis of the text and the specification of its prosodic characteristics. In English synthesis systems, morphemic analysis was implemented in the MITa1k system, for which the percentage of transcriptor errors is 5%. Morphemic analysis is also convenient because with its help it is possible to determine the belonging of words to parts of speech, which is very important for the grammatical analysis of the text and the specification of its prosodic characteristics. In English synthesis systems, morphemic analysis was implemented in the MITa1k system, for which the percentage of transcriptor errors is 5%. Morphemic analysis is also convenient because with its help it is possible to determine the belonging of words to parts of speech, which is very important for the grammatical analysis of the text and the specification of its prosodic characteristics. In English synthesis systems, morphemic analysis was implemented in the MITa1k system, for which the percentage of transcriptor errors is 5%.

A particular problem for this stage of text processing is formed by proper names.

Formation of prosodichesky characteristics

The prosodic characteristics of the utterance include its tonal, accent and rhythmic characteristics. Their physical counterparts are the pitch frequency, energy, and duration. In speech, the prosodic characteristics of a statement are determined not only by its constituent words, but also by what value it carries and for what listener it is intended, the emotional and physical state of the speaker, and many other factors. Many of these factors retain their significance when reading aloud, since a person usually interprets and perceives the text in the process of reading. Thus, from the synthesis system one should expect about the same, that is, that it can understand the text it has at the input, using the methods of artificial intelligence. However, this level of development of computer technology has not yet been achieved, and most modern automatic synthesis systems try to correctly synthesize speech with emotionally neutral intonation. Meanwhile, even this task today is very difficult.

Formation of prosodic characteristics necessary for the sounding of the text is carried out by three main blocks, namely: the block of the arrangement of syntagmatic boundaries (pauses), the block of assigning rhythmic and accent characteristics (duration and energy), the block of attribution of tonal characteristics (frequency of the fundamental tone). When arranging syntagmatic boundaries, parts of a statement (syntagma) are defined, within which the energy and tonal characteristics behave uniformly and which a person can pronounce in one breath. If the system does not pause at the boundaries of such units, then a negative effect arises: it seems to the listener that the speaker (in this case, the system) is choking. In addition, the arrangement of syntagmatic boundaries is also essential for phoneme transcription of the text. The simplest solution is to put boundaries where punctuation dictates them. For the most simple cases where there are no punctuation marks, you can apply the method based on the use of service words. It is these methods that are used in the systems of synthesis of Pro-Se-2000, Infovox-5A-101 and DECT-LK, and in the latter, the prosodically oriented vocabulary, in addition to official words, also includes verb forms.

The task of attributing tonal characteristics is usually set fairly narrowly. In speech synthesis systems, neutral intonation is usually attributed to the sentence. No attempts have been made to model higher-level effects, such as emotional coloration of speech, since this information is difficult, and often impossible, to extract from the text.

English language synthesizer


As an example, consider the development of the "Talking Mouse" of the club of voice technologies at the MSU Science Park. (It is known that in some Russian organizations and companies similar developments are being conducted, however, no details could be found in the press.

The basis of speech synthesis is the idea of ​​combining methods of concatenation and synthesis according to the rules. The method of concatenation with an adequate set of basic elements of compilation provides high-quality reproduction of the spectral characteristics of the speech signal, and a set of rules - the possibility of forming a natural intonational-prosodichesky design statements. There are other methods of synthesis, perhaps in the future, more flexible, but still giving less natural sounding of the text. This is, first of all, parametric (formant) speech synthesis according to the rules or on the basis of compilation, developed for a number of languages ​​by foreign researchers. However, to implement this method, statistically representative acoustic-phonetic databases and the corresponding computer technology are needed,

Language of formal writing of rules of synthesis

To create a convenient and fast mode of change and verification of the rules included in different blocks of the synthesizing system, a formalized and at the same time meaningfully transparent and understandable rule writing language was developed that is easily compiled into the source code of the programs . Currently, the automatic transcriber block has about 1,000 lines recorded in a formalized language for representing rules.

intonation security

The function of the developed rules is to determine the temporal and tonal characteristics of the basic elements of the compilation, which, when processing the syntagma, are selected from the library in the desired sequence by a special processor (coding unit). Prerequisite operations for synthesizing text required for this purpose are: singling out syntagmas, choosing the type of intonation, determining the degree of singularity (stress-non-impact) of vowels and symbolic sound filling of syllable complexes are performed by an automatic transcriptor block.

The time processor also includes rules that specify the length of the pause after the end of the syntagma (final / non-final), which are necessary for the synthesis of a coherent text. There is also a modification of the overall rate of pronunciation of the syntagma and the text as a whole, and in two versions: in the standard one - with a uniform change of all compilation units - and in a special one, giving the possibility of changing the duration of only vowels or only consonants.

The tone processor contains the formation rules for eleven intonation patterns: neutral narrative intonation (period), point intonation, typical of focused answers to questions; intonation of sentences with a contrastive selection of individual words; intonation of a special and general question; intonation of special opposing or comparative questions; intonation of appeals, some types of exclamations and commands; two kinds of incompleteness, enumerative intonation; intonation of intercalated structures.

Allophone database

The necessary speech material was recorded in the following digitization mode: a sampling frequency of 22 kHz with a capacity of 16 bits.

Allophones, the optimal set of which is the acoustic-phonetic base of the synthesis, were chosen as the basic elements of the compilation. The inventory of basic compilation units includes 1,200 items, which takes about 7 MB of memory. In most cases, the elements of the compilation are segments of the speech wave of the phoneme dimension. To obtain the necessary initial base of compilation units, a special dictionary was compiled that contains words and phrases with allophones in all contexts considered. It contains 1130 word usage.

Linguistic analysis

Based on the data received from the remaining speech synthesis modules and from the allophone base, the program for generating an acoustic signal allows modification of the duration of consonants and vowels. It makes it possible to modify the duration of individual periods on vocal sounds, using two or three points of toning on the allophone segment, modifies the energy characteristics of the segment and combines the modified allophones into a single fluent speech.

At the stage of the synthesis of the acoustic signal, the program allows you to get a variety of acoustic effects, such as reverb, echo, change in frequency coloration.

The finished acoustic signal is converted into a data format adopted for the output of audio information. Two formats are used: WAV (Waveform Audio File Format), which is one of the main ones, or VOX (Voice File Format), widely used in computer telephony. The output can also be carried out directly to the sound card.

Russian speech synthesis toolkit

The above-mentioned toolkit for synthesizing Russian speech by text allows you to read mixed Russian-English texts aloud. The toolkit is a set of dynamic libraries (DLL), which includes modules for Russian and English synthesis, a dictionary of Russian accents, and a module for pronouncing English words. A word or a sentence to be pronounced is fed to the input of the toolkit, an audio file in WAV or VOX format comes from the output, which is recorded in memory or on a hard disk.

Speech recognition system
The speech recognition system consists of two parts. These parts can be divided into blocks or subroutines. For simplicity, let's say that the speech recognition system consists of acoustic and linguistic parts. The linguistic part can include phonetic, phonological, morphological, syntactic and semantic models language.

The acoustic model is responsible for representing the speech signal. The linguistic model interprets the information received from the acoustic model and is responsible for presenting the recognition result to the consumer.

acoustic model

Both approaches have their advantages and disadvantages. When developing technical systems, the choice of approach is of paramount importance. Therefore, there are two approaches to building an acoustic model: inventive and bionic. .

Linguistic model

The linguistic block is subdivided into the following tiers (layers, levels); phonetic, phonological, morphological, lexical, syntactic, semantic. There are six of them. The Russian language is taken as the basis. All tiers are a priori information about the structure of a natural language, and, as is known, any a priori information about a subject of interest increases the chances of making the right decision. That is the basis of all statistical radio engineering. And natural language carries highly structured information, which, by the way, implies that each natural language may require its own unique linguistic model (I foresee difficulties with Russification of complex speech recognition systems). In accordance with this model, at the first - phonetic - level, the input (for the linguistic block) representation of speech is transformed into a sequence of phonemes, as the smallest units of the language. It is believed that in a real speech signal, only allophones can be detected - variants of phonemes that depend on the sound environment. But it doesn't change the essence. Please note that phonemes associates can migrate to a linguistic block. At the next - phonological - level, restrictions are imposed on the combinatorics of phonemes (allophones). The restriction is the rule inside out, which means that again there is useful a priori information: not all combinations of phonemes (allophones) occur, and those that do occur have a different probability of occurrence, which also depends on the environment. To describe this situation, the mathematical apparatus of Markov chains is used. Further, at the morphological level, they operate with syllabic units of speech of a higher level than the phoneme. Sometimes they are called morphemes. They impose a restriction already on the structure of the word, obeying the laws of the simulated natural language. The lexical tier covers the words and word forms of a particular natural language, that is, the dictionary of the language, also introducing important a priori information about what words are possible for a given natural language. Semantics establishes the relationship between the objects of reality and the words denoting them. It is the highest level of the language. With the help of semantic relations, the human intellect produces, as it were, a compression of a speech message into a system of images, concepts that represent the essence of a speech message. Hence the conclusion that the system must be "smart".

Classification of speech recognition systems
Classification by purpose:

  • - command systems
  • -text dictation systems

Consumer qualities:

  • - speaker-oriented (trained for a specific speaker)
  • - speaker-independent
  • - recognize individual words
  • - recognizing slurred speech.

According to the mechanisms of functioning:

  • - the simplest (correlation) detectors
  • - expert systems with different ways of forming and processing the knowledge base
  • - probabilistic-network decision-making models, including neural networks.

Conclusion


For a person, it is a dialogue, not a monologue, that is natural and familiar. As a result of underestimating the need for a voice response, increased fatigue of operators, monotony of speech and limited applicability of the speech interface appear. How can a computer equipped with a speech recognizer help a blind person if it lacks a non-visual feedback device?

The fact of involuntary adjustment of the voice to the voice of the interlocutor is widely known. Why not use this human ability to increase the accuracy of computer speech recognition by correcting the pronunciation of the operator using two-way dialogue? In addition, it is quite possible that properly organized and modulated synthesis can significantly reduce the operator's risk of diseases associated with monotony of speech and additional stress. The ubiquitous penetration of the graphical user interface has been achieved through the combined use of a graphical monitor, a graphical output tool, and a mouse for its input, and, not least, thanks to Xerox's ingenious conceptual discoveries in the window interface field.

The future of speech interface is no less dependent on the ability of modern developers not only to create the technological basis of speech input, but also to harmoniously merge technological findings into a single logically complete system of human-computer interaction. The main work is still ahead!


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Approaches and directions for creating Artificial Intelligence

Terms: Approaches and directions for creating Artificial Intelligence