How to write a chat bot?

Lecture

Among the interlocutors, there are artificial intelligences and there are emulators; all programs have different interfaces, different learning abilities, different base sizes, etc. The program does not have to be difficult to please the user - and vice versa, you can spend a lot of effort, develop a complex algorithm, and the program will not cause anyone a spiritual response (except admiration for the abilities of the programmer who managed to write such a monster) and will not use the program .

The fact is that when developing such programs, you need not only to be able to program, but also to know a little psychology, as well as the principles for constructing the phrases of the human language (in this case, Russian). This is perhaps even more important than programming skills ... because for writing simple emulators (just emulators, not real AI), programming knowledge acquired in high school in the first three or four lessons of any available programming language is quite enough. And it is advisable to write replicas of the program without gross spelling errors - they cause a not very pleasant feeling, like rudeness (half of the programs - artificial intelligence emulators) suffer from this.

Just want to emphasize that this article deals with writing not a real AI, but just an emulator. Not too complicated algorithm, but quite acceptable results ...

How is the phrase that the program displays in response to the user's words? There are several options. They can be used individually or in combination. I don’t consider ways to implement them here: you can think of a great many (but some will be more effective, others less).

Analysis of the phrase entered by man

The phrase of the user is not analyzed in any way. In response, a random phrase is displayed. This method may seem primitive, but with a skillful selection of phrases and a large number of them, using this method becomes obvious to the user not immediately.
The user's phrase is searched for keywords; each stimulus word causes a corresponding reaction. Previous phrases are not counted.

There are several options here. Differences in the search for words in the phrase:

whole words are searched for in the initial form;
whole words are searched for in different grammatical forms (the reaction to different forms of the same word can be fundamentally different);
parts of words are searched for, for example, roots (in this case, different forms of one word cause the same reaction);
looking for synonyms of words (in the initial form or in various forms; synonyms can be set by the developer or determined by the program independently, in the process of learning)
not only words, but also punctuation marks, spaces, and also the place of the word in the phrase are taken into account: the phrase begins with this word, ends with it, or the word stands somewhere in the middle;
In the phrase, not one word or combination of characters is sought, but two, three, etc. (they must be simultaneously in the phrase, but can be in any order);
the search is "masked", that is, the phrase is searched for several combinations of characters, standing in a certain order and separated by other sequences of characters.

Special mention should be made of the differences associated with the recognition of interrogative sentences, denials, as well as phrases consisting of several sentences. Ideally, each sentence should be considered separately (perhaps, taking into account its order in the entered phrase: for example, the last sentence entered is considered first of all); the response to interrogative sentences must be different than the corresponding narrative (i.e., differing only in the punctuation mark at the end of the sentence); the particles "not" and "nor" should, when analyzing a phrase, refer to the words they are facing (even in the case of separate spelling). Most programs, given the presence of "not" and "neither" in the text, do not distinguish which word these particles refer to.

The difference in the output of the response phrase:

the same phrase-reaction is always displayed on every word-stimulus;
in response to the stimulus word, a random phrase-reaction is always derived from the same set;
a phrase-reaction in response to a word-stimulus is not simply chosen randomly from a particular set, but also in some way depends on external conditions (for example, on the time of day, on the value of a certain amount “mood” set at the beginning of the program , on the “nature” of the program, etc.);
Sometimes, in response to a stimulus word, a phrase-reaction is not just displayed, but also a certain procedure is performed (for example, in my Talker it was a block of psychological support, a test, writing proverbs, etc.).

Separately, it should be noted that the response replicas can be selected with varying degrees of severity: the program can issue the replica present in the database if the exact word matches in the string entered by the person with the keyword entry in the database, or if some of the keywords match. the degree of coincidence), or even if the well-known keywords are not found in the phrase at all (some programs in this case give a requirement to teach them how to respond to such a phrase; some s off with "generalities", suitable as a response to any remark, and some selected from the base of the first available replica, sometimes excluding the obviously inappropriate, but generally not worrying too much about the meaningfulness of the answer).

The conversation takes into account not only the last phrase of the person, but also the previous phrases (in the simplest case, two or three phrases, in the more difficult one, all the previous conversation, that is, during the dialogue, movement follows a certain graph, the response replica is selected depending on at which vertex of the graph we are, and the vertex to which we need to go after uttering the replica is selected depending on the phrase said by the person). This may be tracking the context of the conversation and / or just taking into account the topic of previous replicas.

The phrase displayed by the program is not selected from the number of ready-made phrases, but is formed by filling in a certain template (or one of the existing templates) with words from the base, depending on the context of the conversation.

Training

Not present in any form. However, this does not mean that such a program cannot be taught anything: in most cases, the databases of such programs are handled quite successfully.
There is a special training mode - separate from the conversation. During normal conversation, the program is not trained (however, the training mode is often present in programs that can be trained directly during the conversation!).
Each phrase of the user during the dialogue is entered into the database without analysis (such programs during the training are very often “stupid”, they begin to issue cues that are not associated with the previous replica of the user).
Each phrase of the user is entered into the database after preliminary analysis (in this case, if the phrase analysis algorithm is good, the program will grow wiser before our eyes, but the size of the database grows catastrophically quickly, and very soon the program starts to slow down).
After a preliminary analysis, not all phrases are entered into the database, but only some phrases (for example, relating to significant topics or being the answer to the most frequently encountered replicas). In this case, even if the phrase analysis algorithm is primitive, the program will get smarter quickly enough, and meaningless remarks will not get into the database; but it is necessary to develop successful criteria for the selection of phrases.

Special consideration requires the addition of human phrases to the database. Its structure may be different. For example, it may simply be a sequence of lines in a file, or records with multiple fields, or a replica tree. If everything is more or less clear with the phrases and notes (add yourself a new phrase to the end of the base - or to the beginning, depending on how it is implemented), then everything is a bit more complicated with the tree. Difficulties arise in determining exactly where in the tree a new phrase should be added. Naturally, if you add it to the first available branch, over time, the program will not get smarter, but, on the contrary, grow stupid. There are algorithms for searching the tree and adding new elements, if we are talking about numbers or strings that are not meaningful, that is, when it is not the meaning of the phrase that matters, but only its length or alphabetical order. Here, first of all, the meaning is important. So, you need to somehow compare the phrases with each other by meaning. Since the entered phrases are unlikely to completely coincide with some reference phrases (with the exception of very short replicas of "Yes", "No", "Of course", "I don't know", etc.), you need to set the criteria for the "similarity" of phrases (taking into account the context, that is, the meaning of the previous replicas). How to find out which of the known phrases is most similar to this phrase? Honestly, I do not know. That is, I can come up with several algorithms, but they will clearly be ineffective and not sufficiently effective. For example, if you look for the maximum number of as long as possible sequences of characters that match in a phrase entered by a person and in phrases already present in the tree, random coincidences may occur if the phrases have very long and very similar words with completely different meanings. If we take, for example, the words "intensification" and "electrification", then the last 8 letters coincide with them, which can cause the program to consider these words to be quite close (after all, in colloquial speech it is not often that long words are used!). If these two words were used in a conversation often enough, the program would have a chance to figure out what is the meaningful difference between them. But most likely such words, having flashed through the conversation, will be forgotten for a long time, and an error will emerge sometime later. In one such error, there would be nothing terrible, but errors accumulate ... In addition, such an algorithm is guaranteed to stumble on homonyms (words with the same spelling and different meanings) - if you do not take into account the meaning of other words of the phrase. Also programs with a similar algorithm will not work very well if a person decides to play with words, using them "not in that sense." I don’t know by what algorithm Dmitry Zhuravlev’s Chat Master program works, but I suspect that there is also an analysis of phrases for maximum similarity with the already existing words and phrases. So, it cost me in a conversation on computer topics with this program to enter some phrase like " Windows 95 - a new level of interactive erotica" or something like that (for the sake of experiment I introduced both "normal" phrases, and jokes, and aphorisms, and quotes from books - but everything is in context, that is, a person would react adequately to such a phrase), - so, it cost me to enter such a phrase into the program as it “stuck”: for some reason, all conversations about computers solely on sex, and I had to spend a lot of effort to retrain the program.

Naturally, I have raised far from all the questions concerning the programs of artificial intelligence and its emulators. There are many more interesting moments, many ideas and ways to implement them, many algorithms with different efficiency ...

How to write a simple program imitating the ability to keep up a conversation in half an hour

Is it difficult to write artificial intelligence emulators? This question is difficult to answer. Writing a real AI is difficult. It is not very difficult to write a good emulator, but for quite a long time (you need to either develop a good algorithm, or create a large base, and you need a lot of time for this and that). And to write a simple emulator with a small base is quite easy, and it can be done in half an hour, owning only the most basic, basic knowledge in any programming language.

So, how to do it and what you need to know and be able to do?

We start a sequence of keywords (in a string, in a text file, in an array, in a huge database - in anything, just to be ordered). Strictly speaking, it can include whole phrases, as well as parts of words.

We start a sequence of phrase-answers corresponding to these keywords (in principle, there is an option when there are many different answer options for each keyword, but we are still considering the simplest case). Where answers are stored is not important: in the same line of the file as the corresponding keyword (through the delimiter), or in another file, or in an array ... The main thing is to be able to find the appropriate response using the keyword.

We start a sequence of standard phrases that the program will issue when it does not know what to answer to it. Here the order of phrases is not important, since we will choose one of them randomly.

When working with the program, enter the string from the keyboard. We run a cycle on all keywords, and if one of them is present in the entered phrase, the corresponding phrase-reaction will answer the program (what to do if there are several keywords in the phrase - this question can be solved differently: you can take the first one that matched a word, you can randomly choose one of the appropriate response phrases, and you can calculate the "weight" of each keyword - but the latter is somewhat more difficult). If no keyword is found in the entered phrase, we issue a standard “empty” phrase.

What you need to know and be able to write such a program? Input from the keyboard and output to the screen. Naturally, the assignment operator. Branching The cycle (in order to sort through a sequence of key phrases, and in some languages, so that you can carry on a dialogue for as long as you like, rather than finish after the first replica). Arrays or work with files. A little - working with string values (in particular, checking the occurrence of one string to another, as well as translating small letters into uppercase - to simplify further processing of the string). And a couple of standard functions to get a random integer in a given range. In fact, part-time students, many of whom sat down at the computer for the first time in their life, not to mention programming, learn all this in five classes (if you don’t take work with files). Of course, such knowledge is very superficial, but for writing a simple emulator they are quite enough. Creating your own procedures and functions for the simplest emulator is not needed, although for more complex programs they are already needed.

Specific example. I will not be attached to any programming language, just write down the idea. We assume that the AI emulator will be written by a person who does not understand completely in programming and has just become acquainted with the structures of a programming language listed above; therefore, we will not deliberately consider working with files. For simplicity, our program will be untrained.

There is a two-dimensional array slova, the size is two by, say, ten (the number of columns can be increased - then the vocabulary of the program will increase), the type of elements is string values. The first line contains keywords and phrases, and the second line contains responses. The values of the elements can be, for example, such:

slova [1,1] = "YOU'RE NAME?"
slova [2,1] = "My name is Program."
slova [1,2] = "CASE?"
slova [2,2] = "I'm fine, and you?"
slova [1,3] = "you can"
slova [2,3] = "My options are very limited."
slova [1,4] = "KNOW"
slova [2,4] = "Unfortunately, I still know very little ..."
slova [1,5] = "CAN"
slova [2,5] = "My options are very limited."
slova [1,6] = "LOVE"
slova [2.6] = "I have no pronounced preferences."
slova [1.7] = "UNDERSTAND"
slova [2.7] = "I still don't understand everything, because I have a small base."
slova [1,8] = "?"
slova [2,8] = "I do not have enough knowledge to answer your question."
slova [1.9] = "HELLO"
slova [2.9] = "It's great to see you."
slova [1,10] = "HELLO"
slova [2,10] = "Hello."

There is also a one-dimensional array of random phrases; it was possible to score these phrases in a file, but we agreed that we will not work with files ...

Here are some of these phrases:

I still do not really understand this issue.
You surprise me with your ability to think.
You can ask, where did you get such information?
It seems to me that you are hiding something from me.
I think many share this view.
I can not communicate with people who are trying to catch me on something.

And so on. The more such phrases there are, the better (so that the phrases do not repeat during a long conversation). You can, of course, use phrases like "I do not understand you," but since they will be used quite often, it will become obvious that the program knows nothing and does not know how, and our task is to "pass for smart".

So:

we enter the phrase fraza from the keyboard, we replace all small letters with capital letters;
assign otwet to the value "" (empty string);
in the loop, while i is less than n (the number of keywords) and otwet is empty, we do the following: if the element of the array slova [1, i] is in the string fraza, we assign the variable otwet to the value slova [2, i] (we exit the loop, if the keywords are over or if we have already found a response phrase);
if the otwet variable is still equal to the empty string, do the following:
выбираем случайное целое число от 1 до m, где m - число "пустых" фраз (в некоторых языках программирования есть для этого специальные функции, а в некоторых есть только случайное число от 0 до 1 - тогда придется записать формулу);
из массива случайных фраз выбираем фразу с таким номером и присваиваем переменной otwet.
выводим значение переменной otwet на экран.

Не намного сложнее, но значительно удобнее поместить ключевые слова и ответные реплики программы в отдельный файл. В этом случае базу программы можно пополнять (либо программно, либо редактируя текстовый файл "вручную"). Простейший вариант: каждое правило базы знаний хранится в отдельной строке файла, сначала идет ключевое слово, затем - через символ-разделитель - ответная реакция. В качестве разделителя можно использовать пробел, но в этом случае невозможно использование словосочетаний в качестве ключевых слов (так как словосочетание содержит пробелы). Удобнее взять какой-то символ, который в тексте (в разговорной речи) практически не встречается, например, "|". Если ключевых слов или вариантов ответа несколько, надо продумать два разделителя: один отделяет ключевые слова от ответов, а второй - один вариант ответа от другого.

Для обработки вводимых фраз нужна процедура поиска ключевых слов во фразе.

Ключевые слова в базе записываются только в верхнем или только в нижнем регистре. При этом введенную фразу нужно перед обработкой привести к верхнему (или, соответственно, нижнему) регистру символов.

В простейшем случае записи базы перебираются последовательно, пользователю предъявляется ответ на первое же ключевое слово базы, встретившееся во фразе. В таком случае возможны два варианта построения базы: в одном варианте значимость ключевых слов считается равноценной, в другом - существуют приоритеты (в этом случае более значимые ключевые слова идут в базе раньше). Определенная последовательность записей в базе важна и в случае, когда одно ключевое слово является частью другого.

Другой случай - учет степени соответствия ключевых слов введенной фразе. В этом случае реализуется проверка вхождения во фразу ключевых слов из разных записей, и в качестве ответа выдается фраза из записи, у которой совпадает максимальное количество ключевых слов. При этом проверка соответствия усложняется. Необходимо определиться, что в данном случае означает "максимальное количество совпадающих слов": учитывается ли общая длина фразы, введенной пользователем; учитывается ли количество слов, входящих во фразу, но не несущих на себе смысловой нагрузки; рассматриваются ли слова как таковые (то есть последовательность символов, с обеих сторон ограниченная либо пробелами, либо знаками препинания, либо началом или концом строки) или просто некие последовательности символов (в первом случае при учете длины фразы пользователя может учитываться общее количество слов в ней, во втором - длина фразы в символах).

Структуру базы можно усложнить: добавить ключевые слова контекста (они должны встречаться не в последней введенной фразе, а в предыдущих фразах пользователя и/или программы), показатели эмоционального состояния и т.д.

Как составить базу фраз для эмулятора искусственного интеллекта?

Разумеется, если речь идет о настоящем искусственном интеллекте - тут все просто. Настоящий ИИ трудно написать, но учить его - одно удовольствие. Разговаривай с ним почаще, и рано или поздно твой искусственный интеллект сможет поддерживать беседу на должном уровне, даже если изначально он не знал почти ничего. А вот с эмуляторами дело обстоит хуже. Их, наоборот, довольно легко писать, но трудно учить (если речь идет об осмысленном обучении; тупое занесение всех фраз подряд в один и тот же файл настоящим обучением считаться не может, так как в этом случае программа скоро начнет выдавать бессмысленные реплики). Обычно эмуляторы ИИ создаются с какой-то готовой базой реплик. Эта база может потом пополняться (программно или вручную), но какая-то основа должна быть. В базу заносятся какие-то фразы, слова или части слов, на которые программа изначально должна уметь реагировать правильно, и ответные реплики. Каковы будут эти ответные реплики - пусть подскажет ваша фантазия. Правда, опыт показывает мне, что одни и те же ответные реплики часто кочуют из одной программы в другую. Вообще заимствование десятка-другого подобных удачных фраз (разумеется, без присвоения авторства) плагиатом, на мой взгляд, не является: такие фразы просто становятся своего рода афоризмами, вроде высказываний Фоменко (их я тоже привожу в пример не случайно: они встречаются в базах многих эмуляторов искусственного интеллекта). Немалую пользу приносит также изучение чужой базы на предмет приобретения полезного опыта. Но полное заимствование чьей-то чужой базы - это не есть хорошо, и не только потому, что это плагиат. Кому нужны два десятка собеседников с одним и тем же словарным запасом? Итак, будем поступать честно, в соответствии с заповедью "не укради", и не станем заимствовать чужую базу.

How to make your base? As I have already said, regardless of the method of implementation, you need to select keywords and come up with response phrases. Let's start with keywords, phrases, whole sentences or, conversely, parts of a word. All this can be divided into categories.

Еще важный момент: орфография и пунктуация. Пользователь может вводить как фразы, написанные без ошибок, так и с ошибками. Если есть несколько вариантов написания, правильный и несколько ошибочных, надо в обязательном порядке уметь реагировать на правильный вариант и по возможности - на наиболее распространенные ошибочные. Например, если вы хотите, чтобы программа реагировала на извинения, обязательно надо уметь реагировать на правильный вариант "извини" и желательно - на наиболее популярный неправильный "извени" (если программа предположительно будет использоваться преимущественно людьми, которые пишут именно так), а редко встречающиеся опечатки типа "изввини", естественно, можно не принимать в расчет. То же относится и к расстановке знаков препинания: желательно учитывать правильные и наиболее популярный неправильные варианты.

Where do I get responses from? Naturally, it would be best to invent them yourself. In most cases, this is done. But there are quite popular sources of replicas for artificial intelligence emulators. These are jokes, aphorisms, sayings, Fomenko phrases, quotes from songs, from favorite movies and books, etc. Most often, such a phrase instantly comes to mind. What to do if you didn't come? You can painstakingly search for a well-known phrase that matches the case. And you can do it easier. Suppose we need to answer the question "What is money?". You can answer "How do I know" or come up with your own definition - sometimes it turns out quite original. But if the invented answer is not different originality - go, for example, on Yandex, and look for pages with aphorisms or anecdotes, which have the word "money".Among the found will be guaranteed a couple of phrases that satisfy your exacting taste. Of course, this is just one of many ways ...

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote

How to write a chat bot?

Comments

To leave a comment

Natural Language Modeling of Thought Processes and Character Modeling

Terms: Natural Language Modeling of Thought Processes and Character Modeling