You get a bonus - 1 coin for daily activity. Now you have 1 coin

Character Recognition

Lecture



When it comes to recognizing printed characters, it should be mentioned that an almost endless variety of printed products is made using a limited set of original characters, which are grouped by style (set of artistic solutions) that distinguishes this group from others. One group, including all alphabetic characters, numbers and a standard set of service characters, is called a headset. However, in a wide circle of people dealing with the production of various kinds of documentation, another name of the headset was adopted - the font; this term we will adhere to in the future.

So any printed text has a primary property - the fonts with which it is printed. From this point of view, there are two classes of printable character recognition algorithms: font and omnifont. Font or font dependent algorithms use a priori information about the font with which the letters are printed. This means that the OCR program should be presented with a full sample of the text printed in this font. The program measures and analyzes various font characteristics and puts them into its base of reference characteristics. At the end of this process, the Optical Character Recognition (OPC) font program is ready to recognize this particular font. (Recently, tasks that require training have become associated with the use of neural networks, but the technology that does not use NN is developing here). This process can be called the training program. Further, the training is repeated for a certain set of fonts, which depends on the scope of the program. The disadvantages of this approach include the following factors:

  • The algorithm must know in advance the font that it represents for recognition, i.e. He must store in the database the various characteristics of this font. The quality of recognition of text printed in an arbitrary font will be directly proportional to the correlation of the characteristics of this font with the fonts available in the program database. With the existing wealth of printed materials in the learning process it is impossible to cover all the fonts and their modifications. For example, Polygraphbummash USSR at one time standardized about 15-20 different fonts, in modern computer-based document layout systems more than 100 fonts are used. In other words, this factor limits the versatility of such algorithms.
  • For the operation of the recognition program, a setting block for a specific font is required. Obviously, this block will contribute its share of errors to the integral assessment of the quality of recognition, or the function of setting the font will have to be assigned to the user.
  • The program, based on the font character recognition algorithm, requires the user to have special knowledge about fonts in general, their groups and differences from each other, the fonts with which the document is printed, the user. Note that in case the paper document was not created by the user himself, but came to him from the outside, there is no regular way to find out using which fonts this document was printed on. The need for specialist knowledge narrows the circle of potential users and shifts it towards organizations with a staff of appropriate specialists.

On the other hand, the font approach has an advantage due to which it is actively used and, apparently, will be used in the future. Namely, having detailed a priori information about the characters, one can construct highly accurate and reliable recognition algorithms. In general, when constructing a font recognition algorithm (as opposed to fontless, as will be discussed below), the reliability of character recognition is an intuitively clear and mathematically precisely expressible quantity. This value is defined as the distance in any metric space from the reference symbol presented to the program during the training process, to the symbol that the program is trying to recognize.

The second class of algorithms is fontless or font independent, i.e. algorithms that do not have a priori knowledge of the characters arriving at their input. These algorithms measure and analyze various characteristics (features) inherent in letters as such, regardless of the font and absolute size (size) with which they are printed. In the limiting case for the font-independent algorithm, the learning process may be absent. In this case, the characteristics of the characters are measured, encoded and placed in the base of the program by the person himself. However, in practice, cases where such a path exhaustively solves the task, are rare. A more general way to create a database of characteristics is to train the program on a selection of real characters. The disadvantages of this approach include the following factors:

  • The real attainable recognition quality is lower than that of font algorithms. This is due to the fact that the level of generalization in measuring the characteristics of characters is much higher than in the case of font-dependent algorithms. In fact, this means that the various tolerances and coarsening when measuring the characteristics of characters for the operation of the fontless algorithms can be 2–20 times greater than the font ones.
  • It should be considered a great success if the fontless algorithm is adequate and physically sound, i.e. naturally arising from the basic procedure of the algorithm, the coefficient of reliability of recognition. It is often necessary to put up with the fact that the accuracy rating is either missing or artificial. By artificial evaluation, it is meant that it does not substantially coincide with the probability of correct recognition provided by this algorithm.

The merits of this approach are closely related to its shortcomings. The main advantages are as follows:

  • Versatility. This means, on the one hand, the applicability of this approach in cases where the potential variety of symbols that can be input to the system is large. On the other hand, due to the ability to generalize incorporated in them, such algorithms can extrapolate the accumulated knowledge beyond the training sample, i.e. consistently recognize characters that are far away from those that were present in the training set.
  • Manufacturability. The learning process of font-independent algorithms is usually simpler and integrated in the sense that the training set is not fragmented into different classes (in fonts, pins, etc.). At the same time, there is no need to maintain various conditions for the coexistence of these classes in the database of characteristics (uncorrelated, non-miscible, unique naming system, etc.). A manifestation of manufacturability is also the fact that it is often possible to create almost fully automated training procedures.
  • Convenience in the process of using the program. In case the program is built on font independent algorithms, the user is not required to know anything about the page that he wants to enter into the computer memory and notify the program about this knowledge. The user interface of the program is also simplified due to the lack of a set of options and dialogs serving the training and management of the database of characteristics. In this case, the recognition process can be presented to the user as a “black box” (while the user is completely unable to control or modify the recognition process in any way). As a result, this leads to the expansion of the circle of potential users due to the inclusion of people with minimal computer literacy.

Synthesis of two approaches

The above features, advantages and disadvantages of the two approaches to the creation of OCR algorithms. From the review it follows that the advantages and disadvantages of both approaches are determined by the same properties of the algorithms: a greater or lesser degree of universality, a degree of attainable recognition accuracy, etc. Comparative disadvantages and advantages of both approaches are tabulated.

Properties

Font Algorithms

Non-font algorithms

Versatility

Low degree of universality, due to the need for prior training in everything that is presented for recognition

A large degree of universality due to the independence of the training set from any system of a priori classification of symbols

Recognition accuracy

High, due to the detailed classification of characters in the learning process. And also the fact that the recognition material is strictly within the classes created in the learning process

Low (compared to font algorithms), due to the high degree of generalization and coarse measurements of character characteristics

Manufacturability

Low (compared to fontless algorithms) due to various overheads associated with supporting character classification

High due to the absence of any prior character classification system

Support recognition process by the user

Required:
- at the training stage to set the classification system;

- at the stage of recognition to indicate specific classes of characters

Not required

Consideration of both approaches in comparison with each other leads to the expediency of combining them. The purpose of the merger is obvious - to get a method that combines both the versatility and manufacturability of the fontless approach and the high accuracy of font recognition. The following background of ideas and facts served as prerequisites for research in this direction. Any character recognition algorithm becomes applicable in practice with a recognition quality of 94-99%. “Pressing” the last percent, i.e. The final refinement of the algorithm is always time consuming and expensive work. Within the scope of character recognition, any algorithm has its specific scope for which it is designed and in which it manifests itself in the best possible way. In general, the way to increase the quality of recognition does not lie in the invention of the super-intelligent algorithm, which replaces all the others, but in combining several algorithms, each of which is in itself simple and has an efficient computational procedure. When combining different algorithms, it is important that they rely on independent sources of information about symbols. If the two algorithms work on data that are strongly correlated with each other, then instead of increasing the quality of recognition, the total error will increase. On the other hand, knowledge of the recognized characters should be accumulated and used in the subsequent steps of the recognition process. Moreover, as a final criterion, you can use an exact font-dependent algorithm, the base of characteristics of which is built right on the fly (“on the fly”) based on the results of previous recognition steps. A method with the above property will be called adaptive recognition, since it uses dynamic tuning (adaptation) to the tonal input characters.
created: 2014-09-22
updated: 2021-03-13
132500



Rating 9 of 10. count vote: 2
Are you satisfied?:



Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Natural Language Modeling of Thought Processes and Character Modeling

Terms: Natural Language Modeling of Thought Processes and Character Modeling