11: Automating the transformation of structured texts

Lecture



Attention!

This lecture is intended for those who already have basic skills in the XML 1 markup language.

Requirements for automatic translation of tables

Serious disadvantage of the proposed   11: Automating the transformation of structured texts 10.3 solving the problem of automatic conversion of transition tables into programs is associated with the idea of ​​preprocessing, which is convenient for processing the representation of transition tables. Ignoring the feedback between the original representation of the automaton and its interpreted representation raises problems. If you do not consider the development of the program, then you do not need to track this connection. But as soon as the question arises at least about the minimum alterations, problems arise.

  • Making changes to the table does not entail an automatic change to the internal representation, and possible violations of agreements related to the original decision may make reuse meaningless. This is especially noticeable if you pay attention to the fact that there are no state names in the resulting table.
  • Search and diagnostics of errors (it is not only about the syntax) are possible only at the level of the interpreted representation, which contradicts the understanding of the program in previous terms.
  • The presence of a program that does not depend on the original representation provokes its modifications that will not be transferred to the original representation. As a result, the conceptual and implementation models diverge among themselves.
  • Thus, the proposed solution is not technologically advanced. It can be extended to other situations only for a limited class of tasks.

The development of technology includes the creation and use of tools to support all the stages that make up the technological chain of operating with tables. The key point is the choice of the presentation format of the tables, adapted to both manual and automatic processing. It makes sense to try to find a universal solution. If, in addition, it is standardized, it will be possible to create technology. Today, the prerequisites for such a solution exist, in particular, the use of so-called markup languages ​​and, first of all, XML.

Formulation of the problem

When discussing the program 10.3.1, it was shown that the proposed text is marked with the symbols _i, i = 1, ..., 5 as well as how to move from the marked text to a program in C ++ or to the interpretation of the table. This markup has only one drawback: it is not standardized, and therefore there is no reason to talk about extending beyond the environment of the developers of the system designed to support the use of tables.

The following markup languages ​​are currently used, which can be considered standard:

  • T E X and, above all, the superstructure above it L A T E X - for typographical preparation of texts.
  • HTML is a standard markup language on the Internet.
  • XML is an online markup language focused on active work with them.

Their main purpose is a structured presentation of information visualization. From the point of view of automatic translation of tables, it is advisable to discuss the algorithmic aspect of these languages, i.e. to answer the extent to which the text markup allows us to describe the required transformations. But first we will ask questions about whether markup languages ​​are legitimate to consider programming languages, and if so, what programming styles they can in principle support.

The construction of various kinds of visualization of texts and related information is the defining function of markup languages, which is implemented by placing markup commands together with the presented information. These commands are control signals for the calculator, which is responsible for visualization. It is natural to consider the internal joint presentation of commands and data for their execution as an operating and at the same time information environment of the calculator-visualizer. A markup language from a programming language is primarily distinguished by the combination of data and tasks for their processing .

Example 11.2.1 . The text of this example in the markup language L A T E X (more precisely, in the package of definitions above it, which is designed for the layout of this text and other works of the author) for the automatic placement of the title, the creation of an identifier, on which you can further refer to Example 11.2. 1 , formatting the body of the sample and practicing its end, surrounded by commands

  \ begin {example} \ label {markupeample0}
 and \ end {example}. 

In order to print the first of these commands, I had to apply additional markup (so that the command was interpreted as part of the text)

  \ verb | \ begin {example} \ label {markupeample0} | \\. 

And, finally, the previous line also required additional markup, but in order to avoid infinite regress, we’ll end there.

For markup languages ​​used on the Internet, the browser serves as a visualizer. But visualization is far from the only calculation that can be linked to a browser or other program that processes tagged text. Above the information you can set different generalized calculations . These calculations can either be combined in one process, or separated in time, which is not fundamental. In relation to the numeric markup used in   11: Automating the transformation of structured texts 10.3 to set the state machine table, it is legitimate to interpret the markup as commands of three types of calculations:

  • table visualization - interpretation of markup symbols as commands prescribing the relative position of text fragments for display and printing;
  • translation of the table - automatic construction of a program that performs the actions of the machine;
  • table interpretation - interpretation of markup symbols as commands that perform actions with operands, which use text fragments.

Similar calculations can be defined for any markup language, considering suitable abstract calculators. In other words, there is markup language programming. This concept is most fully traced in the XML / XSL technology.

If we turn to the programming style, which in principle can support a markup language, then this is a purely programming. A typical type of computation of this style is not just the generation of new data with the preservation of operands, but global data transformations that depend essentially on the context. Note that the opposite is true at the moment: any sentence language can be considered as a markup language of the processed data. Sometimes such markup is explicitly placed on the data structure (for example, Refal language), sometimes it is placed in a special structure (Prolog), but in both cases the context defining command execution and data replacement by generating command results remain. In the development of markup languages, there is a tendency to separate the recognition of the structure of the text - the sentence part of the processing - from using this structure to specify actions that define generalized calculations 2 .

The solution of the problem of automatic transformation of the state machine tables using the XML language most suitable for this purpose is described below (see the book [19]).

If we describe texts in modern markup languages, such as L A T E X [15] or XML, the problem arises of describing and programming the transformations of such texts. The solution to this problem can be specialized text transformation languages ​​or corresponding programming techniques supported by automated conversion of specifications to a program. Here we look at the methodology based on automata.

Let us apply the capabilities of the XML / XSL system to our task: the description of the state machine (the transition table 9.1 is taken as the basis).

  <? xml version = "1.0" encoding = "windows-1251"?>
 <automat name = "Test">
   <action> <! [CDATA [char symbol;  int cnt;]]>
   </ action>
   <ref> St1 </ ref>
   <state name = "St1">
   <if>
     <condition> <! [CDATA ['a' <= symbol & symbol <= 'z']]>
     </ condition>
     <action> <! [CDATA [printf ("% c", symbol);  cnt = 1;]]>
     </ action>
     <ref> St2 </ ref>
   </ if>
   <eif>
     <condition> <! [CDATA [symbol! = '\ n']]>
     </ condition>
     <action> <! [CDATA [/ * Since you only need to type words,
                                       actions are not filled * /]]>
     </ action>
     <ref> St1 </ ref>
   </ eif>
   <eif>
     <condition> <! [CDATA [symbol == '\ n']]>
     </ condition>
     <action> <! [CDATA [/ * Since you only need to type words,
                                       actions are not filled * /]]>
     </ action>
     <ref> St3 </ ref>
   </ eif>
   </ state>
     <state name = "St2">
     <if>
       <condition> <! [CDATA ['a' <= symbol & symbol <= 'z']]>
       </ condition>
       <action> <! [CDATA [printf ("% c", symbol);  cnt ++;]]>
       </ action>
       <ref> St2 </ ref>
     </ if>
     <eif>
       <condition> <! [CDATA [symbol! = '\ n']]>
       </ condition>
       <action> <! [CDATA [printf ("-% i \ n", cnt);]]>
       </ action>
       <ref> St1 </ ref>
     </ eif>
     <eif>
       <condition> <! [CDATA [symbol == '\ n']]>
       </ condition>
       <action> <! [CDATA [printf ("-% i \ n", cnt);]]>
       </ action>
       <ref> St3 </ ref>
     </ eif>
   </ state>
   <state name = "St3">
     <if>
       <condition> <! [CDATA ['a' <= symbol & symbol <= 'z']]>
       </ condition>
       <action> <! [CDATA [printf ("% c", symbol);  cnt = 1;]]>
       </ action>
       <ref> St2 </ ref>
     </ if>
     <eif>
       <condition> <! [CDATA [symbol! = '\ n']]> </ condition>
       <action> <! [CDATA [/ * Since you only need to type words,
                                         actions are not filled * /]]>
       </ action>
       <ref> St1 </ ref>
     </ eif>
     <eif>
       <condition> <! [CDATA [symbol == '\ n']]>
       </ condition>
       <action> <! [CDATA [/ * The transition does not require reading,
                                         symbol == '\ n' do not need to read * /]]>
       </ action>
       <ref> Exit </ ref>
     </ eif>
   </ state>
 </ automat> 
Listing 11.2.1. Solution of the problem of automatic transformation of the state machine tables

This description reflects the following properties of our state machine (the basis for the transition table 9.1 is taken):

  • The state machine (<automat> tag) includes <state> tags — a list of all states in any order, as well as <action> tags - action and <ref> - link to the initial state 3 .
  • Each <state> contains a name attribute so that it can be referenced, and a set of conditions: one <if> tag and any number of <eif> tags (meaning " else if "). These two tags could also be replaced by one universal one, especially since the structure of their descendants does not differ, but again this was not done for reasons of ease of formatting.
  • Each <if> or <eif> includes three tags: <condition> is the condition itself and the tags already described <action> and <ref> - the action when the condition is met and a link to the next state.
  • The <action> and <condition> tags contain a special <! [CDATA [...]]> tag to include strings in C ++.

The description does not include the Exit state. This part is the same for different automata, and therefore it is impractical to describe it explicitly.

This description serves as the basis for constructing various visualizations of an automaton using XSL. For example, it’s quite simple to build a table visualization, which is almost the same as the previously compiled table:

  <? xml version = "1.0"?>
 <xsl: stylesheet xmlns: xsl = "http://www.w3.org/TR/WD-xsl">
   <xsl: template match = "/">
     <DIV STYLE = "font-family: Courier; font-size: 12pt">
       <xsl: for-each select = "automat">
         <TABLE border = "1" width = "75%">
           <Tr>
             <TD colspan = "3"> <xsl: value-of select = "action" /> </ TD>
             <TD width = "10%"> <xsl: value-of select = "ref" /> </ TD>
           </ Tr>
           <xsl: for-each select = "state">
             <Tr>
               <td rowspan = "3" width = "10%" valign = "top">
               <xsl: value-of select = "@ name" /> </ td>
               <td> <xsl: value-of select = "if / condition" /> </ td>
               <td> <xsl: value-of select = "if / action" /> </ td>
               <td> <xsl: value-of select = "if / ref" /> </ td>
             </ Tr>
             <xsl: for-each select = "eif">
               <Tr>
                 <Td> <xsl: value-of select = "condition" /> </ td>
                 <Td> <xsl: value-of select = "action" /> </ td>
                 <Td> <xsl: value-of select = "ref" /> </ td>
               </ Tr>
             </ xsl: for-each>
           </ xsl: for-each>
         </ TABLE>
       </ xsl: for-each>
     </ Div>
   </ xsl: template>
 </ xsl: stylesheet> 
Listing 11.2.2. Solution of the problem of automatic transformation of the state machine tables

The next visualization is the automatic transformation of the XML framework into a C / C ++ program. It is worth paying attention to the fact that the resulting text is almost the same as in program 10.2.5. The reasons are deeper than ease of conversion. The existence of a local (without the use of contextual "= dependent information) description follows from the structure of the finite state machine that does not require anything outside the state to define the transition.

  <? xml version = "1.0"?>
 <xsl: stylesheet xmlns: xsl = "http://www.w3.org/TR/WD-xsl">
   <xsl: template match = "/">
     <blockquote>
       <pre STYLE = "font-family: Courier; font-size: 12pt;"  >
         <xsl: for-each select = "automat"> // C code for automat
           "<xsl: value-of select =" @ name "/>"
           <! [CDATA [
           #include <stdio.h>
           #define failure true
           void main (void)
           {]]>
           <xsl: value-of select = "action" />
             goto <xsl: value-of select = "ref" />; <br/>
           <xsl: for-each select = "state">
             <xsl: value-of select = "@ name" />: symbol = getchar ();
             <blockquote> if (<xsl: value-of select = "if / condition" />)
               <blockquote> {<xsl: value-of select = "if / action" />
                 goto <xsl: value-of select = "if / ref" />;  }
               </ blockquote>
               <xsl: for-each select = "eif">
                 else if (<xsl: value-of select = "condition" />)
               <blockquote> {<xsl: value-of select = "action" />
                 goto <xsl: value-of select = "ref" />;  }
               </ blockquote>
             </ xsl: for-each>
           </ blockquote>
           </ xsl: for-each>
           Exit: return; <br/>}
         </ xsl: for-each>
       </ pre>
     </ blockquote>
   </ xsl: template>
 </ xsl: stylesheet> 
Listing 11.2.3. Solution of the problem of automatic transformation of the state machine tables

As it is easy to see, the difference between this and previous style files is only that where graphic elements are inserted into the table for visualization, frames are displayed, parts of the C ++ / C # syntax and indents are placed for translation in order to make the program more readable.

So, using the XML data presentation format, we were able to automatically create programs for this task by simply replacing the style sheets!

We now note some more interesting features that we could use.

Although this technology makes it easy to create C ++ / C # programs, XML is the basis for them. To create a new machine, the programmer must at least know the syntax of this language. The next natural step is to exclude this link: it is required to hide the internal presentation of data from the end user, leaving only an intuitive, table-like representation. But for this, you first need to choose:

  • transformation table => XML representation and
  • tool for easy editing of tables.

It is natural to edit the tables in the same place where we have already learned how to generate them - in the browser window. Access to editing these tables can be provided by DOM (a standard implemented in Internet Explorer 5.0 and Netscape Navigator 6.0 browsers). Changes, additions, and other editing actions are pretty straightforward. For example, in Java script, adding a new cell to a table can be described as follows:

  var oRow;
 var oCell;

 oRow = oTable.insertRow ();
 oCell = oRow.insertCell ();
 oCell.innerHTML = "This cell is <b> new </ b>." 

Similarly, you can create tables, rows and cells in them. The implementation of the interface itself (buttons, selection tools, input fields) depends only on your imagination.

The same DOM, in the same way, can work with XML, replicating all the actions of the end user with the table in the XML representation for writing the latter as a finished file with the structure of the new finite state machine.

Another step in the development of the project using the XML language may be the formalization of the presentation: after all all the presentation tags, the rules of their attachment and the methods of setting are defined, a new language is obtained (by analogy with modern languages ​​built in the same way, we can call it automatML). As long as tags and XML elements are used solely for the sake of convenience for your own project (as if you used CSS on your homepage), it does not matter what you give these elements and tags names, the meaning of which differs from the standard one and is known only to you. . If you want to provide data to the outside world and receive information from other people, then this circumstance takes on immense importance. Elements and attributes should be used by you just like all other people, or at least you should document what you are doing.

To do this, you will need to use the document type definition (DTD). Stored at the beginning of the XML file or externally as a * .DTD file, these definitions describe the information structure of the document. DTDs list possible element names, determine the available attributes for each element type and the compatibility of some elements with others.

Each line in the document type definition can contain an element type declaration, name the element, and determine the type of data that the element can contain. It has the following form:

  <! ELEMENT element_name (data_type)> 

For example, the declaration

  <! ELEMENT action (#PCDATA)> 

defines an element with the name action containing character data (i.e., text). Declaration

  <! ELEMENT automat (state_1, state_2, state_3)> 

defines an element named automat that contains the sub-elements state_1, state _2, and state_3 in the specified order, for example:

  <automat>
 <state_1> XML: the time has come </ state_1>
 <state_2> XML exceeds itself </ state_2>
 <state_3> Network and System Management
          using XML </ state_3>
 </ automat> 

After defining DTD elements, attributes can also be defined using the! ATTLIST command. It indicates an element, names the attribute associated with it, and then describes its valid values. The! ATTLIST command allows you to manage attributes in many other ways: set defaults, suppress spaces, and so on. DTDs can contain! ENTITY declarations, which contain references to objects, and! NOTATION declarations, indicating what to do with binary files presented not in XML format.

A serious and somewhat surprising limitation of the DTD is that they do not allow data typing, that is, they limit the data to a specific format (such as a date, an integer or a floating-point number). As you probably noticed, DTDs use a different syntax than XML, and is not very intuitive. For the reasons mentioned, the DTDs will apparently be replaced by more powerful and easy-to-use circuits that are currently being worked on.


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Programming Styles and Techniques

Terms: Programming Styles and Techniques