Stemming .. Stemmer Porter

Lecture

temming is the process of finding the basis of a word for a given source word. The basis of the word does not necessarily coincide with the morphological root of the word.

The task of finding the basis of the word is a long-standing problem in the field of computer science. The first publication on this issue dates back to 1968 ^[⇨] . Stemming is used in search engines to expand the user's search query ^[⇨] , it is part of the text normalization process.

A specific way of solving the problem of searching for a stem of words is called a stemming algorithm , and a specific implementation is called stemmer ^[⇨] .

Story

The first published stemmer was written by Julie Beth Lovins in 1968 ^[1] . This article has a significant early publication date and has had a great influence on later work in this area.

Stemmer was later written by Martin Porter and published in 1980. This stemmer was very widely used and became the de facto standard algorithm for texts in English. Dr. Porter received the Strix Award in 2000 for his work on stemming and searching for information.

Many implementations of Porter’s stemming algorithm were written and freely distributed; however, many of these implementations contain difficult-to-find flaws. As a result, these algorithms did not work in full force. To eliminate this type of error, Martin Porter released the official free implementation of the algorithm around 2000. He continued this work over the next few years, developing Snowball, a framework for creating stemming algorithms, and improved English language markers, as well as markers for several other languages.

Algorithms

There are several types of stemming algorithms that differ in terms of performance, accuracy, and how certain problems of stemming are overcome.

Search algorithms

A simple stemmer searches for an inflectional form in the lookup table. The advantages of this approach are its simplicity, speed, and ease of exception handling. The disadvantages include the fact that all inflected forms must be explicitly listed in the table: new or unfamiliar words will not be processed, even if they are correct (for example, iPads ~ iPad), and also the problem is that the lookup table can be very big. For languages with simple morphology like English, the sizes of the tables are small, but for strongly inflected languages (for example, Turkish), the table may have hundreds of possible inflectional forms for each root.

The lookup tables used in stemmers are usually generated semi-automatically. For example, for the English word “run”, the forms “running”, “runs”, “runned” and “runly” will be automatically generated. The last two forms are valid constructs, but they are unlikely to appear in plain English.

The search algorithm can use preliminary partial markup to avoid this kind of lemmatization error when different words refer to the same lemma (overstemming) ^[2] .

Truncation Algorithms

Truncation algorithms do not use the reference table, which consists of inflectional forms and relations of the root and form. Instead, a smaller list of “rules” is usually kept, which is used by algorithms, taking into account the shape of the word, to find its basis ^[3] . Some examples of the rules are as follows:

if the word ends in 'ed', delete 'ed'
if the word ends in 'ing', delete 'ing'
if the word ends in 'ly', delete 'ly'

Truncation algorithms are much more efficient than full brute force algorithms. To develop such algorithms, a programmer is needed who is quite well versed in linguistics, in particular morphology, and also knows how to encode “truncation rules”. Truncation algorithms are inefficient for exceptional situations (for example, 'ran' and 'run')). The solutions obtained by the truncation algorithms of terminations are limited to those parts of speech that have well-known endings and suffixes with some exceptions. This is a serious limitation, since not all parts of speech have a well-formulated set of rules. Lemmatization is trying to remove this restriction.

Prefix truncation algorithms can also be implemented. However, not all languages have prefixes and suffixes.

Additional Algorithm Criteria

Truncation algorithms may differ in results for a variety of reasons. One of these reasons is the peculiarity of the algorithm: should the word at the output of the algorithm be a real word in a given language. Some approaches do not require a word in the appropriate language vocabulary. In addition, some algorithms maintain a database of all known morphological roots that exist as real words. These algorithms check the presence of a term in the database for decision making. As a rule, if a term is not found, alternative actions are performed. These alternative actions may use slightly different criteria for decision making. A non-existent term may serve the application of alternative truncation rules.

It may be that two or more truncation rules apply for the same input term, which creates uncertainty about which rule to apply. The algorithm can determine the priority of the implementation of such rules (using a human or stochastic method). Or the algorithm may reject one of the rules if it leads to a non-existent term, while the other does not. For example, for the English term “friendlies,” the algorithm can define the suffix “ies”, apply the appropriate rule, and get the result “friendl.” The term "friendl", most likely, will not be found in the lexicon and, therefore, this rule will be rejected.

One of the improvements in the truncation algorithm for endings is the use of suffix substitution and endings. Like the truncation rule, the substitution rule replaces the suffix or ending with the alternative. For example, there may be a rule that replaces “ies” with “y”. Since the truncation rules lead to a nonexistent term in the lexicon, the substitution rules solve this problem. In this example, “friendlies” is converted to “friendly” instead of “friendl”.

Usually the application of these rules is cyclical or recursive. After the first use of the substitution rule for this example, the algorithm selects the next rule for the term “friendly”, which will reveal and recognize the truncation rule of the suffix “ly”. Thus, the term “friendlies” with the help of the substitution rule becomes the term “friendly”, which, after applying the truncation rule, becomes the term “friend”.

This example helps to demonstrate the difference between the rule-based method and the brute force method. With the help of complete enumeration, the algorithm will search for the term “friendlies” in a set of hundreds of thousands of inflectional word forms and, ideally, will find the corresponding basis of “friend”. In the rule-based method, the rules are consistently executed, resulting in a solution as well. Most likely, the rule-based method will be faster.

Affix stemmers

In linguistics, the most common terms for affixes are the suffix and prefix. In addition to the approaches that handle suffixes or endings, some of them also handle prefixes. For example, for an English word indefinitely, this method will determine that the “in” construct at the beginning of a word is a prefix and can be removed to get the stem of the word. Many of the methods mentioned above also use this approach. For example, the algorithm for truncating endings can handle both suffixes and endings, as well as prefixes, then it will have a different name and follow this approach. Studies of affix stemmers for several European languages can be found in a publication (Jongejan et al 2009) .

Lemmatization Algorithms

A more complex approach to solving the problem of defining the stem of a word is lemmatization. To understand how lemmatization works, you need to know how various forms of a word are created. Most words change when they are used in different grammatical forms. The end of a word is replaced by a grammatical ending, and this leads to a new form of the original word. Lemmatization performs the inverse transformation: replaces the grammatical ending with a suffix or ending of the initial form ^[4] .

Also, lemmatization includes the definition of a part of speech of a word and the application of various normalization rules for each part of speech. The determination of a part of speech occurs before an attempt to find a basis, since for some languages the rules of stemming depend on the part of speech of a given word.

This approach is extremely dependent on the exact definition of a lexical category (part of speech). Although there is a partial overlap between the normalization rules for some lexical categories, specifying the wrong category or the inability to determine the correct category negates the advantage of this approach over the terminating truncation algorithm. The basic idea is that if a stemmer has the ability to get more information about the word being processed, then he can apply more accurate normalization rules.

Ripple-down rules

Initially, Ripple-down rules were developed to acquire knowledge and maintain rule-based systems. In this approach, knowledge is acquired on the basis of the current context and is gradually added. Rules are created to classify cases according to a specific context.

Unlike standard classification rules, Ripple-down Rules uses exceptions to existing rules, so changes are only related to the context of the rules and do not affect others. Knowledge acquisition tools help find and modify conflicting rules. Here is a simple example of a Ripple-down rule:

if a ^ b then c

except if d then e

else if f ^ g then h

This rule can be interpreted as follows: “if a and b are true, then we decide c , except when d is not true. If d is true (exception), then we decide e . If a and b are not true, then we go to another rule and decide h , if f and g is true. ” This form of the rules very well solves the problem of lemmatization ^[5] .

To create an exception to the rule, the algorithm must first determine the word that induced the exception. After that, the differences between the two words are determined. The rule exclusion condition will follow these differences.

Stochastic Algorithms

Stochastic algorithms are associated with probabilistic determination of the root form of a word. These algorithms build a probabilistic model and are trained using the correspondence table of root and inflectional forms. This model is usually presented in the form of complex linguistic rules, similar in nature to the rules used in the algorithms for truncating endings and lemmatization. Stemming is done by entering modified forms to train the model and generating the root form in accordance with the model’s internal set of rules, except that the decisions related to the application of the most appropriate rule or sequence of rules, as well as the choice of the stem of the word, are based on the fact that The resulting correct word will have the highest probability (incorrect words have the lowest probability).

Some lemmatization algorithms are stochastic in the sense that a word can belong to several parts of speech with different probabilities. These algorithms can also take into account surrounding words, called context. Context-free grammars do not include any additional information. In any case, after assigning the probability of each possible part of speech, the part of speech is more likely to be selected, as well as the corresponding rules for obtaining a normalized form.

Statistical algorithms

N-gram analysis

Some stemming algorithms use N-gram analysis in order to select a suitable basis for the word ^[6] .

Text corpus stemming

One of the main drawbacks of classical stimmers (for example, Porter's simmer) is that they often do not distinguish words with a similar syntax, but with completely different meanings. For example, “news” and “new” as a result of stemming will be reduced to the basis of “new”, although these words belong to different lexical categories. Another problem is that some stemming algorithms may be suitable for one body and cause too many errors in another. For example, the words "stock", "stocks", "stocking", etc. will have a special meaning in the texts of The Wall Street Journal. The basic idea of stemming on the basis of the corpus of texts is to create equivalence classes for words of classical stimmers, and then “break” some words, combined on the basis of their occurrence in the corpus. It also helps to prevent the well-known Porter algorithm conflicts, for example, as “policy / police”, since the chance to meet these words together is rather low ^[7] .

Matching algorithms

Such algorithms use a database of fundamentals (for example, a set of documents containing the basics of words). These fundamentals do not necessarily correspond to ordinary words, in most cases the base is a substring (for example, for English, “brows” is a substring in the words “browse” and “browsing”). In order to determine the basis of a word, the algorithm tries to match it with the bases from the database, applying various restrictions, for example, on the length of the search base in the word relative to the length of the word itself (for example, the short prefix "be", which is the basis of such words as “Be”, “been” and “being” will not be the basis of the word “beside”).

Hybrid approaches

Hybrid approaches use two or more of the methods described above. A simple example is an algorithm that uses a suffix tree, which at the beginning of its work uses lookup tables to obtain initial data using a full brute force. However, instead of storing the entire complex of relationships between words for a particular language, the lookup table is used to store a small number of “frequent exceptions” (for example, for the English language “ran => run”). If the word is not in the exclusion list, the termination or lemmatization algorithms are applied to get the result.

Languages

Language features

While most of the early scientific work in this area was focused on English (mainly using Porter’s stemming algorithm), subsequent works were also devoted to many other languages ^[8] ^[9] ^[10] ^[11] ^[12] .

Hebrew and Arabic are still considered difficult to study in terms of stamming. English stemming algorithms are fairly trivial (only sometimes problems arise, for example, the word “dries” is the third person’s singular form of the present tense of the verb “dry”, or the word “axes” is a plural form of “axe” and “axis”); but it is becoming increasingly difficult for simmers to design when a more complex target language is chosen, namely, a language with more complex morphology and spelling. For example, stemmers for Italian are more complex than stemmers for English (due to the large number of inflected forms of verbs), for the Russian language implementation is even more difficult (large number of declensions of nouns), for Hebrew - even more complex (due to non-relative morphology , writing without vowels and the need for prefix truncation algorithms: Hebrew word bases can be two, three or four characters long, but no more), and so on.

Multilingual stemming algorithms apply the morphological rules of two or more languages simultaneously.

Stemming Russian language

The Russian language belongs to the group of inflected synthetic languages, that is, languages in which word formation prevails with the use of affixes that combine several grammatical meanings at once (for example, a good one - the end indicates simultaneously a single number, the masculine gender and the nominative case), therefore this language allows the use of algorithms stemming. Russian has a complex morphological variability of words, which is a source of errors when using stemming. As a solution to this problem, along with classical algorithms for stemming, we can use lemmatization algorithms that lead words to the initial basic form.

Consider the most popular implementations of stimmers, based on various principles and allowing the processing of non-existent words for the Russian language.

Stemmer porter

The basic idea of Porter’s stemmer is that there are a limited number of word-forming suffixes, and the word stemming occurs without using any bases of bases: only the set of existing suffixes and manually set rules.

The algorithm consists of five steps. At each step, the word-forming suffix is cut off and the rest is checked for compliance with the rules (for example, for Russian words the base must contain at least one vowel). If the received word satisfies the rules, the transition to the next step takes place. If not, the algorithm chooses another suffix for clipping. In the first step, the maximum formative suffix is cut off, in the second - the letter “and”, in the third - the word-forming suffix, in the fourth - the suffixes of excellent forms, “ь” and one of the two “n” ^[13] .

This algorithm often cuts off the word more than necessary, which makes it difficult to obtain the correct stem of the word, for example, bed-> shelter (at the same time the unchanged part is the bed , but the stemmer chooses the longest morpheme to remove). Also Porter's stemmer does not cope with all sorts of changes to the root of the word (for example, drop-downs and cursory vowels).

Steamka

This algorithm of stemming (analyzer) was developed by Andrei Kovalenko in 2002. It is based on a probabilistic model: the words from the training text are parsed by the analyzer into pairs “last two letters of the base” + “suffix” and if such a pair is already present in the model, its weight increases, otherwise it is added to the model. After that, the resulting data array is ranked in descending order of weight and the models, the probability of implementation of which is less than 1/10000, are cut off. The result - a set of potential endings with conditions on the preceding characters - is inverted for convenience of scanning word forms “from right to left” and is presented in the form of a state machine transition table. When parsing, the word is scanned according to the constructed transition tables. A special rule was also added to the algorithm, stating that an immutable base must contain at least one vowel ^[14] .

The presented analyzer is available in source code and can be used in free form with the condition of reference to the source ^[15] ^[16] .

MyStem [

Stemmer MyStem was developed by Ilya Segalovich in 1998. Now it is owned by Yandex ^[17] . At the first step, using the suffix tree in the input word, possible boundaries between the base and suffix are determined, after which, for each potential basis (starting with the longest) binary search in the base tree, its presence in the dictionary or finding the bases closest to it (measure proximity is the length of the common "tail"). If the word is a dictionary, the algorithm finishes the job; otherwise, it proceeds to the next partition.

If the base variant does not match any of the “closest” vocabulary bases, then this means that the analyzed word with this base variant is missing in the dictionary. Then, based on the existing basis, suffix and model of the “nearest” vocabulary base, a hypothetical model of the change of the given word is generated. The hypothesis is remembered, and if it has already been built earlier, it increases its weight. If the word was never found in the dictionary - the length of the required total end of the bases is reduced by one, the tree is being reviewed for new hypotheses. When the length of the common “tail” reaches 2, the search stops and the hypotheses are ranked by productivity: if the weight of the hypothesis is five or more times less than the greatest weight, this hypothesis is eliminated. The result of the work of the stemmer is the resulting set of hypotheses for a nonexistent one or a hypothesis for a dictionary word ^[18] .

Stemmer can be used for commercial purposes; exceptions are: the creation and distribution of spam, search engine optimization sites and the development of products and services similar to the services and products of Yandex ^[17] . Source codes are not distributed ^[19] . To install, just download and unzip the archive ^[20] .

Types of errors

There are two types of errors in the algorithms of stemming : overstemming ' and understemming . Overstemming is a mistake of the first kind, when inflectional words are mistakenly attributed to one lemma. Understemming is a mistake of the second kind, when morphological forms of one word are referred to different lemmas. Stemming algorithms try to minimize both these errors, although reducing one type of error can lead to an increase in the other ^[21] .

Consider these types of errors in the work of Porter's stemming algorithm. The case of overstemming error: this algorithm will match the words “universal”, “university” and “universe” with the base “univers”; Although these words are etymologically different, their modern meanings refer to different areas, therefore, they are not synonymous with synonyms. The case of an understemming error: Porter's algorithm will compare words that are derived from one lemma with different fundamentals, and therefore assign them to different lemmas - “alumnus” → “alumnu”, “alumni” → “alumni”, “alumna” / “alumnae” → “Alumna” (these words retained Latin features in their morphology, but these almost-synonyms were not combined by the stemmer).

Application

Stemming is used as an approximate method for grouping words with similar basic meanings. For example, the text that mentions “daffodils” is probably closely related to text that mentions “daffodil” (without “s”). But in some cases, words with the same basis have idiomatic meanings that are almost unrelated: when searching for documents containing “marketing”, the user will also receive documents that mention “markets” but not “marketing” (which is rather altogether, does not correspond to the information need of the user).

Search for information

Stemming is quite common in search engines. However, relatively soon, the effectiveness of stemming in the search engines for the English language was recognized as very limited, and this led novice researchers in the field of information retrieval to understand the inapplicability of stemming in the general case ^[22] ^[23] . In search engines, instead of stemming, an approach based on the search for N-grams and not the basics can be used. In addition, recent studies have shown great advantages when searching with N-grams for languages other than English ^[24] ^[25] .

Domain Analysis

When analyzing the subject areas with the help of stemming dictionaries of these areas are built ^[26] .

Use in commercial products

Many commercial companies have already used stemming, at least since the 1980s, and have developed algorithmic and lexical templates for many languages ^[27] ^[28] .

Snowball-stemmers were compared with commercial ones. The results were mixed ^[29] ^[30] .

The Google search engine has begun to use stemming since 2003 ^[31] . Previously, a search for “fish” would not return results containing “fishing”.

Stemmer porter

Porter's Stemmer is a stemming algorithm published by Martin Porter in 1980. The original version of the template was intended for English and was written in BCPL. Subsequently, Martin created the project “Snowball” and, using the basic idea of the algorithm, wrote stemmers for common Indo-European languages, including Russian ^[1] .

The algorithm does not use the fundamentals of words, but only, applying successively a series of rules, cuts off the endings and suffixes, based on the features of the language, and therefore works quickly, but not always unmistakably.

The algorithm was very popular and replicable, changes were often made to it by different developers, and not always successful ones. Around 2000, Porter decided to “freeze” the project and continue to distribute a single implementation of the algorithm (in several popular programming languages) from his website.

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote