20.2. System R * Distributed Database Management System

Lecture

The main objective of the project can be formulated as follows: to provide means for the integration of local System R databases located in the nodes of the computer network so that the user working in any network node has access to all these databases as if they were centralized . This should provide:

ease of use of the system;
the possibilities of autonomous operation in cases of network connectivity or administrative needs;
high degree of efficiency.

To solve these problems, it was necessary to make a number of design decisions related to the decomposition of the original request, the optimal choice of the method for executing the request, coordinated execution of transactions, ensuring synchronization, detection and resolution of distributed deadlocks, recovery of the state of databases after various kinds of network node failures.

Ease of use of the system is achieved due to the fact that users of System R * (application developers and end users) remain in the environment of the SQL language, i.e. can continue to work under the same external conditions as in System R (both SQL / DS and DB2). The ability to use SQL is based on System R * ensuring the transparency of the data location. The system automatically detects the current location of the data objects mentioned in the user's request; The same application program, including SQL statements, can be executed in different nodes of the network. At the same time, at each node of the network, at the stage of compiling the query, the most optimal plan for executing the query is selected in accordance with the data distribution in the distributed system.

A great deal of attention is paid to ensuring the autonomy of network nodes in System R *. Each local database is administered independently. Autonomous connection of new users, changing the version of the autonomous part of the system, etc. are possible. The system is designed in such a way that it does not require centralized object naming or deadlock detection services. Individual nodes do not require global knowledge of operations performed in other nodes of the network; Work with available databases can continue if individual network nodes or communication lines fail.

A high degree of system efficiency is one of the most key requirements for distributed database management systems in general and System R * in particular. To achieve this goal, two basic techniques are used.

First, as in System R, in System R *, the query is preceded by compiling it. In the course of this process, the database object names used in the query are searched in the distributed directory and the names are replaced with internal identifiers; checking the access rights of the user on whose behalf the compilation is being performed to perform the relevant operations on the databases and select the most optimal global query execution plan, which is then decomposed and in parts sent to the corresponding network nodes, where the optimal local execution plans of the query components are selected and Access modules are generated in machine codes. As a result, many actions are performed at the compilation stage before the actual execution of the query. An application program processed using the System R * precompiler that includes SQL statements can later be executed many times without additional overhead. Using the distributed catalog, distributed compilation and query optimization are the most interesting and original aspects of the System R * project.

The second means of improving the system's efficiency is the ability to move remote relationships to a local database. The SQL dialect used in System R * includes the MIGRATE TABLE clause, in which the specified relation is transferred to the local database. This tool, which is at the disposal of users, of course, in some cases can help to achieve a more efficient passage of transactions. Naturally, as for all operations, the MIGRATE operation in relation to the specified relation is not available to any user, but only to those who have the corresponding right.

Before proceeding to a more detailed presentation of the most interesting aspects of the implementation of System R *, we mention some of the tools that the developers of this system intended to implement at the initial stage of the project, but which were not implemented (and some of them, apparently, will never be implemented) . It was supposed to have in the system means of horizontal and vertical separation of distributed database relations, means of duplicating relations in several nodes with the support of copy consistency and means of maintaining instant snapshots of the state of databases in accordance with a given query.

To set the horizontal separation of relations in SQL, a construction of the form was introduced

  DISTRIBUTE TABLE <table-name> HORIZONTALLY INTO
    <name> WHERE <predicate> IN SEGMENT <segment-name site>
         .
         .
    <name> WHERE <predicate> IN SEGMENT <segment-name site>

When executing a sentence of this type, the specified relation was divided into a number of relations containing tuples satisfying the corresponding predicate from the WHERE clause, and each sub relation obtained in this way was sent to the specified node for storage in the segment with the specified name. The consistent state of the sections is guaranteed when the relationship changes.

Vertical separation was performed using the operator

  DISTRIBUTE TABLE <table-name> VERTICALLY INTO
  <name> WHERE <column-name-list> IN SEGMENT <segment-name site>
     .
     .
  <name> WHERE <column-name-list> IN SEGMENT <segment-name site>

When executing such a proposal, a set of relations was also formed using the projection of a given relationship onto attributes from a given list. Each received sub-relation was then sent for storage in the segment with the specified name to the corresponding node. After that, the system is responsible for maintaining the consistent state of the formed partitions.

Horizontal and vertical separation of relations is not really used in System R *, although it is obvious that the implementation of the actual DISTRIBUTE operator does not cause any technical difficulties. Difficulties arise in ensuring consistency of sections (see below). In addition, divided relationships are very difficult to use. In accordance with the ideology of the system, the optimizer should take into account the presence of relationship sections in different nodes of the network; the number of potential query plans that should be evaluated by the optimizer increases even more. Given that the number of possible plans in a distributed system is already very large, and the optimizer is working at the limit of complexity, it is impossible to use divided relationships in a reasonable way. The developers of the System R * optimizer were not able to take into account the separation of relationships. Therefore, it is still meaningless to introduce divided relationships into the system.

To specify the requirement to support copies of a relationship in several nodes of the network, it was proposed to use a new SQL construct.

  DISTRIBUTE TABLE <table-name> REPLICATED INTO
   <name> IN SEGMENT <segment-name site>
      .
      .
   <name> IN SEGMENT <segment-name site>

When executing such a proposal, copies of the specified relationship should have been sent for storage in the named segments of the specified network nodes. The system should automatically maintain consistency of copies.

As in the case of divided relationships, apart from the significant problems of maintaining consistency of copies, the problem is the rational use of copies, the presence of which should be taken into account by the optimizer.

Creating a snapshot of the state of the databases in accordance with a given query for the sample had to be done using a new SQL construct.

  DEFINE SNAPSHOT <snapshot-name> (<attribute-list>)
    AS <query>
    REFRESHED EVERY <period>

When the sentence is executed, the sample query specified in it is actually executed, and the resulting relation is stored under the name specified in the sentence in the local database in the node where the sentence is executed. After that, the snapshot is periodically updated in accordance with the memorized request.

You can update the snapshot without waiting for the time interval specified in the definition to expire by executing the REFRESH SNAPSHOT <snapshot-name> clause.

Sensible use of snapshots is more realistic than using divided relationships and copied relationships, since they can in a sense be viewed as materialized database views. The name of the snapshot could be used directly in the sample query where you can use the names of the base relations or views. Big problems are associated with updating relationships through their snapshots, because at the time of updating the contents of the snapshot may differ from the current contents of the base relation.

With respect to snapshots of problems maintaining a consistent snapshot state and basic relationships, there is no need for automatic reconciliation. As for divided relations and excavated relations, for them this problem is common and rather difficult. First, the harmonization of sections and copies causes significant overhead when performing modification operations stored relationships. This requires the development and compliance with special modification protocols.

Secondly, the introduction of the copied relationship is usually done not so much to increase the efficiency of the system, but rather to increase the availability of data if the network connectivity is broken. In systems using this approach, in the event of a network failure, work with the distributed database usually continues in only one of the resulting subnets. In this case, the voting algorithms are used to select a subnet; the decision is made based on the number of connected nodes in the network. Other approaches are also used, but all of them are very expensive, and most importantly, they do not agree well with the basic approach of System R * regarding the choice of how to perform the query at the compilation stage. Therefore, it seems to us that in System R * there will never be implemented means allowing one way or another to maintain copies of relationships in several nodes of the network.

Next, we consider aspects of the System R * project, which are reflected in its implementation and are, in our opinion, the most interesting: means of naming objects and organizing a distributed database catalog; approach to distributed compilation and query execution; features of the use of representations; query optimization tools; transaction management features; synchronization tools and distributed synchronization deadlock detection algorithm.

20.2.1. Object naming and distributed directory organization

First of all, recall that the full name of a relationship (base or view) in the System R database is of the form username.relation name, where username identifies the user who created the relationship, and relationshipname is the name that was specified in the CREATE TABLE or CREATE VIEW clauses. In queries, you can specify either this full name of the relationship or its local name. In the second case, when compiling, the standard rules of complementing the local name to full are used, using the user ID from which the compilation is performed as a component username.

System R * uses the development of this approach. The system name of the relationship includes four components: the identifier of the user who created the relationship; ID of the network node in which the relationship creation operation was performed; the local name of the relationship assigned to it during creation; the ID of the node in which the relationship was located immediately after its creation (recall that the relationship can move from one node to another when the MIGRATE operation is performed).

The SQL query can use system object names, but it is allowed to use short local names (or a local name qualified by the user name). In this case, two interpretations of the local name are possible. It can be interpreted as part of the system name, in which case, by default, it is added to the system name, based on the identifier of the node where the compilation is performed, and the user name on whose behalf it is produced (unless the user name is explicitly specified). The second possible interpretation of a local name is to consider it as the name of a previously defined synonym for a system name.

To define synonyms, SQL is extended with a view operator

  DEFINE SYNONYM <relation-name> AS <system-wide-name>.

When such an offer is executed, the relevant information is entered into the local directory.

Thus, when compiling a query, it is always possible to determine the system names of all relations used in it: either they are explicitly indicated, or can be obtained on the basis of information from local relationship-directories.

The concept of a distributed directory System R * is based on the presence of a unique system name for each object of the distributed database. The following convention is accepted: information about the location of any database object (identifier of the current node where the object is located) is stored in the local directory of the node where the object was located immediately after creation (the generic node).

Therefore, in order to obtain complete information about the relationship, in general, you must first use the local directory of the node in which the compilation takes place, then refer to the remote directory of the generic node of the relation and finally use the directory of the current node. Thus, to obtain accurate system information about any relation to a distributed database, you may need at most two remote access to relationship-directories.

Some optimization of this procedure is applied. Copies of directory entries of other nodes (a kind of cache directory) can be stored in the local node directory. Consistency of copies of catalog items is not supported. This information is used in the first stage of compiling the query (we consider distributed compilation in the next subsection), and then, in the second stage, if the information relating to an object is inaccurate, it is updated based on the local directory of the node where the object is currently stored. time. Detection of incorrectness of a copy of a catalog item is due to the presence of a version number for each catalog item. Given the sufficient inertia of system information, this optimization can be significant.

20.2.2. Distributed query compilation

As we have already noted, queries in the SQL language are compiled before their actual execution. As in the case of System R, a query can be compiled at the stage of precompiling an application program written in a traditional programming language (PL / 1, Cobol, assembler) with the inclusion of SQL statements, or in the dynamics of the execution of a transaction when executing a PREPARE statement. From the point of view of users, the compilation process in System R * leads to the same results as in System R: for each SQL statement, a program for machine codes (access module section) is formed, the calls of which are placed in the text of the original application program.

However, in reality, the process of compiling a query in System R * is much more complicated than in System R, which is natural due to the much more complex network interactions that will be required in the actual execution of a transaction. Distributed compilation of queries in System R * includes many technical tweaks and subtleties. We will not cover them all in this article for reasons of lack of information and limited volume. Consider only the general scheme of distributed compilation.

We will call the main node the network node in which the SQL statement compilation process was initiated, and the additional nodes are those nodes that are involved in this process during its execution. At the most coarse level, the compilation process can be broken down into the following phases:

The main node is parsing the SQL statement with the construction of the internal representation of the query in the form of a tree. Based on information from the local directory of the main node and the remote directories of the additional nodes, the names of the objects appearing in the request are replaced with their system identifiers.
A global query execution plan is generated at the head node, which takes into account only the order of node interactions during the actual execution of the query. To develop a global plan, the optimization technique used in System R is used. The global plan is displayed in the query tree that was transformed accordingly.
If additional nodes participate in the global execution plan of the query, it is decomposed into parts, each of which can be performed in a single node (for example, local filtering of the relationship according to the constraint predicate specified in the selection condition). The relevant parts of the request (in the internal representation) are sent to the additional nodes.
В каждом узле, участвующем в глобальном плане выполнения запроса (главном и дополнительных) выполняется завершающая стадия выполнения компиляции. Эта стадия включает, по существу, две последние фазы процесса компиляции запроса в System R: оптимизацию и генерацию машинных кодов. Производится проверка прав пользователя, от имени которого производится компиляция, на выполнение соответствующих действий; происходит обработка представлений базы данных (здесь имеются тонкости, связанные с тем, что представления могут включать удаленные отношения; ниже мы еще остановимся на этом, а пока будем считать, что в запросе употребляются только имена базовых отношений); осуществляется локальная оптимизация обрабатываемой части запроса в соответствии с имеющимися индексами; наконец, производится генерация кода.

20.2.3. Transaction Management and Synchronization

Выполнение транзакции в распределенной системе управления базами данных System R*, естественно, является распределенным. Транзакция начинается в главном узле при обращении к какой-либо секции ранее подготовленного (на этапе компиляции) модуля доступа. Как и в System R, модуль доступа загружается в виртуальную память задачи, обращение к секции модуля доступа - это вызов подпрограммы. Однако, в отличие от System R, эта подпрограмма, кроме своего локального программного кода и вызовов функций RSS, содержит еще и вызовы удаленных подсекций модуля доступа. Эти вызовы интерпретируются в духе вызовов удаленных процедур. Тем самым выполнение одной транзакции, инициированной в некотором узле сети A влечет, вообще говоря, инициирование транзакций в дополнительных узлах. Основной новой по сравнению с System R проблемой является проблема согласованного завершения распределенной транзакции, чтобы результаты ее выполнения во всех затронутых ею узлах были либо отображены в состояние локальных баз данных, либо полностью отсутствовали.

To achieve this goal, System R * uses a two-phase distributed transaction completion protocol. This protocol is commonly used in distributed database systems and is described in many references. Therefore, we will describe it here very briefly and informally.

Для описания протокола используется следующая модель. Имеется ряд независимых транзакций-участников распределенной транзакции, выполняющихся под управлением транзакции-координатора. Решение об окончании распределенной транзакции принимается координатором. После этого выполняется первая фаза завершения транзакции, когда координатор передает каждому из участников сообщение "подготовиться к завершению". Получив такое сообщение, каждый участник переходит в состояние готовности как к немедленному завершению транзакции, так и к ее откату. В терминах System R* это означает, что буфер журнала с записями об изменениях базы данных участника выталкиваются на внешнюю память, но синхронизационные захваты не снимаются. После этого каждый участник, успешно выполнивший подготовительные действия, посылает координатору сообщение "готов к завершению". Если координатор получает такие сообщения ото всех участников, то он начинает вторую фазу завершения, рассылая всем участникам сообщение "завершить транзакцию", и это считается завершением распределенной транзакции. Если не все участники успешно выполнили первую фазу, то координатор рассылает всем участникам сообщение "откатить транзакцию", и тогда эффект воздействия распределенной транзакции на состояние баз данных отсутствует.

In relation to the features of the implementation of a two-phase transaction completion protocol in System R *, we note the following. The coordinator is a transaction that runs in the main node, i.e. The one that initiated additional transactions. Thus, the presence of a central coordinating node is not required, which corresponds to the requirement of node autonomy. For transaction rollbacks, the System S base point saving mechanism is used. Finally, the classic two-phase completion protocol is optimized to reduce the number of messages required.

Как и в System R, согласованность состояния баз данных при параллельном выполнении нескольких транзакций в System R* обеспечивается на основе механизма синхронизационных захватов объектов базы данных при соблюдении двухфазного протокола захватов. Напомним, что это означает разбиение каждой транзакции с точки зрения синхронизации на две фазы - рабочую фазу, на которой захваты только устанавливаются, и фазу завершения, когда все захваты объектов базы данных, произведенные данной транзакцией, снимаются. Синхронизация производится в точности так же, как и в System R: каждая транзакция-участник обращается к локальной базе данных через RSS своего узла. Основной новой проблемой является проблема возможных распределенных тупиков, которые могут возникнуть между несколькими распределенными транзакциями, выполняющимися параллельно. (Тупики между транзакциями - участниками одной распределенной транзакции невозможны, поскольку все участники получают один общий идентификатор транзакции и не конфликтуют по синхронизации). Для обнаружения распределенных синхронизационных тупиков в System R* применяется оригинальный распределенный алгоритм, не нарушающий требования автономности узлов сети и минимизирующий число передаваемых по сети сообщений и необходимую процессорную обработку.

Основная идея алгоритма состоит в том, что в каждом узле периодически производится анализ на предмет существования тупика с использованием информации о связях транзакций по ожиданию ресурсов, локальной в данном узле и полученной от других узлов. При проведении этого анализа обнаруживаются либо циклы ожиданий, что означает наличие тупика, либо потенциальные циклы, которые необходимо уточнить в других узлах. Эти потенциальные циклы представляются в виде специального вида строк. Строка представляет собой по сути дела список транзакций. Все транзакции упорядочены в соответствии со значениями своих идентификаторов ("номеров транзакций"). Строка передается для дальнейшего анализа в следующий узел (узел, в котором выполняется самая правая в строке транзакция) только в том случае, если номер первой транзакции в строке меньше номера последней транзакции. (Это оптимизация, уменьшающая число передаваемых по сети сообщений). Этот процесс продолжается до обнаружения тупика.

If the presence of a synchronization impasse is detected, it is destroyed due to the destruction (rollback) of one of the transactions included in the cycle. As a victim, a transaction is selected that has completed the least amount of work at this point. This information is also transmitted over the network along with lines describing the communication of pending transactions.

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote