Breaking up an expression into lexemes in the C language

Lecture



In order to calculate expressions, it is necessary to be able to break them into separate components. For example, the expression A * B - (W + 10) consists of the following elements: A, *, B, -, (, W, +, 10 and). Each of them represents a single indivisible part of the expression. In general, a function is needed that returns all the elements of the expression one by one. This function should also be able to skip spaces and tabs and define the end of an expression.

Each element of the expression is called a token . Therefore, the function that returns the next token is often called get_token () . This function uses a global pointer to a string with a parsed expression. In the version of the get_token () function shown here, this global pointer is called prog . The variable prog is described globally, since it must maintain its value between calls to the get_token () function and be accessible to other functions. In addition to the value of the returned token, you need to know its type. For the analyzer developed in this chapter, only three types are needed: variable, number, and separator. They correspond to the constants VARIABLE , NUMBER and DELIMITER , ( DELIMITER is used both for operators and for parentheses.) The following is the text of the get_token () function along with the required global descriptions, constants and auxiliary function:

  #define DELIMITER 1
 #define VARIABLE 2
 #define NUMBER 3

 extern char * prog;  / * pointer to the expression being analyzed * /
 char token [80];
 char tok_type;

 / * This function returns the next token.  * /
 void get_token (void)
 {
   register char * temp;

   tok_type = 0;
   temp = token;
   * temp = '\ 0';

   if (! * prog) return;  / * end of expression * /
   while (isspace (* prog)) ++ prog;  / * skip spaces, characters
                                     tabs and blank lines * /

   if (strchr ("+ - * /% ^ = ()", * prog)) {
     tok_type = DELIMITER;
     / * advance to the next character * /
     * temp ++ = * prog ++;
   }
   else if (isalpha (* prog)) {
     while (! isdelim (* prog)) * temp ++ = * prog ++;
     tok_type = VARIABLE;
   }
   else if (isdigit (* prog)) {
     while (! isdelim (* prog)) * temp ++ = * prog ++;
     tok_type = NUMBER;
   }

   * temp = '\ 0';
 }

 / * Returns the value TRUE, if c is a splitter.  * /
 int isdelim (char c)
 {
   if (strchr ("+ - / *% ^ = ()", c) || ​​c == 9 || c == '\ r' || c == 0)
     return 1;
   return 0;
 }

Let's look at the above functions in more detail. After several initializations, the get_token () function checks to see if the end-of-line character ('0') that terminates the expression has reached. If the expression still has an unparsed part, the get_token () function first skips the leading spaces, if any. After that, the prog variable points to a number, variable, operator, or - if the expression is terminated by spaces - to the end of line character ('0'). If the next character is an operator, it is returned as a string stored in the global variable token , and the variable tok_type containing the type of the received token is assigned the value DELIMITER . If the next character is a letter, it is considered the name of a variable and is returned in the string variable token . This causes the tok_type to be VARIABLE . In the case when the next character is a digit, the whole number is read, and it is placed in the token variable, and its type is NUMBER . Finally, if the next character is not one of the above, it is considered that the end of the expression is reached. In this case, the token contains an empty string, the return of which indicates the end of the expression.

As mentioned earlier, in order not to complicate the code of this function, some error control tools were omitted and some assumptions were made. For example, any unrecognized character ends the expression. In addition, in this version of the program, variable names can be of any length, but only the first letter is significant. In accordance with the requirements of a specific task, you can complicate error control and add other details. You can modify or modify the get_token () function to select strings of characters, numbers of other types, or other types of tokens from the input expression.

To better understand how the get_token () function works , the following are the tokens returned to it and the types of tokens for the following input expression:

  A + 100 - (B * C) / 2 
Lexeme Type of token
BUT Variable
+ DELIMITER
100 NUMBER
- DELIMITER
( DELIMITER
AT Variable
* DELIMITER
WITH Variable
) DELIMITER
/ DELIMITER
2 NUMBER
zero (end of line) 0 (zero)

It should be remembered that the token variable always contains a string terminated by the end of the string ('0'), even if this string consists of only one character.

created: 2014-12-22
updated: 2021-03-13
132595



Rating 9 of 10. count vote: 2
Are you satisfied?:



Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Algorithms

Terms: Algorithms