7. KEY TRANSFORMATION (INSTALLATION)

Lecture



The method of arrangements ( hashing ) is aimed at quickly solving the problem of the location of an element in the data structure. In the arrangement method, the data is organized as a regular array. Therefore, H is a mapping that converts keys into array indices, hence the name "key conversion", usually given to this method. It should be noted that there is no need to refer to any dynamic allocation procedures, because an array is one of the basic, static objects. The key conversion method is often used for such tasks, where trees can be successfully used.

The main difficulty associated with the conversion of keys is that the set of possible values ​​is much wider than the set of valid memory addresses (array indices). Take as an example the names of up to 16 letters and representing keys that identify individuals from a set of one thousand people. Therefore, we are dealing with 26 16 possible keys that need to be mapped to 10 3 possible indexes. Therefore, the function H will be a function of the class “many values ​​in one value”. If some key k is given, then the first step of the search operation is the calculation of the associated index h = H (k), and the second (absolutely necessary) check is whether h identifies in the array (table) T an element with the key k.

7.1. The choice of conversion function



It goes without saying that any good conversion function should distribute keys as evenly as possible across the entire range of index values. If this requirement is satisfied, then there are no other restrictions, and it’s even good if the transformation looks like completely random. This feature explains the somewhat unscientific name of this method - “grinding” (hashing), that is, splitting the argument, turning it into some kind of “mess”. The function H will be called the “placement function”. It is clear that H must be calculated quite effectively, that is, it must consist of a very small number of basic arithmetic operations.

Imagine that there is a transformation function ORD (k), denoting the ordinal number of the key k in the set of all possible keys. In addition, we will assume that the index of the array i lies in the range 0 ... N - 1, where N is the size of the array. In this case, it is clear that you need to select the following function:


H (k) = ORD (k) MOD N


It is characterized by a uniform mapping of key values ​​over the entire range of variation of the indices, therefore, it forms the basis of most key transformations. In addition, with N equal to the degree of two, this function is effectively calculated. However, if the key is a sequence of letters, then it is from such a function that should be discarded. The fact is that in this case the assumption of equal probability of all keys is erroneous. As a result, words differing only in a few characters will most likely be displayed in the same index, which will result in a very uneven distribution. Therefore, in practice it is recommended to take a prime number as N. The consequence of this choice will be the need to use the full division operation, which can no longer be replaced by the allocation of several binary digits.

7.2. Algorithm



If it is found that the table row corresponding to the given key does not contain the desired element, then a conflict has occurred, that is, two elements have such keys that are mapped to the same index. In this situation, we need a second attempt with the index, quite definitely obtained from the same given key. There are several methods for the formation of a secondary index. An obvious technique is to link together all the rows with the identical primary index H (k). Turning them into a linked list. This technique is called direct chaining. Items in the resulting list can either be placed in the main table or not; in this case, the memory where they are located is usually called the overflow area. The disadvantage of this technique is that you need to keep track of such secondary lists and in each row, allocate space for the link (or index) to the corresponding list of conflicting elements.

Another method for resolving conflicts is that we completely refuse links and instead look at other rows in the same table - until we find the desired element or the empty row. In the latter case, we assume that the specified key is not in the table. This technique is called open addressing . Naturally, in the second attempt the sequence of indices must always be the same for any given key. In this case, the viewing algorithm is based on the following scheme:


h = h (k)

i = 0

repeat

if T (h) = k

then element   found

else if T (h) = free

then there is no element in the table

else { conflict }

i: = i + 1

h: = H (k) + G (i)

endif

endif

until either found or not in the table (or it is full)


Offered a variety of functions G (i) for conflict resolution. The review given in Morris (1968) stimulated vigorous activity in this direction. The simplest method is to look at the next row of the table (we will consider it circular), and so on until either an element with the specified key is found or an empty line is encountered. Therefore, in this case, G (i) == i, and the indices h i , used *** in subsequent attempts, are as follows:


h 0 : = H (k)

h i : = (h i + i) MOD N, i = 1 ... N - 1


This technique is called linear probes, its disadvantage is that the rows tend to be grouped around primary keys (i.e., keys for which there was no conflict when switching on). Of course, I would like to choose such a function G, which would again uniformly scatter the keys along the remaining rows. However, in practice this leads to too high costs, therefore some compromise methods are preferable; being simple enough in terms of computation, they are still better than a linear function. One of them is the use of a quadratic function, in this case the sequence of the indices being tried is as follows:


h 0 : = H (k)

h i : = (h i + i 2 ) MOD N, i> 0


If we use the recurrence relations for h i = i 2 and d i = 2i + l for h 0 = 0 and d 0 = l, then when calculating the next index we can do without the squaring operation.


h i + 1 = h i + d i

d i + 1 = d i + 2 (i> 0)


Such a technique is called quadratic samples, it is essential that it allows you to avoid grouping inherent in linear samples, without leading to almost additional calculations. Its small drawback is that when searching, not all rows of the table are tried, i.e., when you turn on an element, there may not be free space, although in fact it is. If the size N is a prime number, then at quadratic samples, at least half of the table is visible. Such a statement can be derived from the following reasoning. If the i-th and j-th tests lead to the same row of the table, then the equality


i 2 MOD N = j 2 MOD N


or


(i 2 - j 2 ) = 0 (MOD N)


By decomposing the difference into two factors, we get


(i + j) (i - j) = 0 (MOD N)


Since i  j, then either i or j must be greater than N / 2 in order for the equality i + j = cN to be valid, where c is an integer. In practice, this drawback is not so significant, since N / 2 secondary attempts are very rare in conflict resolution, mainly in cases where the table is almost full.


test questions


  1. What is the layout method for?

  2. What determines the position of an element in an array in the arrangement method?

  3. What function can be used as a hash function:

  4. What is called a conflict?

  5. Which method is the conflict resolution method.

  6. Which of the methods of conflict resolution allows you to more evenly distribute the elements of the array.

  7. Why is it impossible to use the function Trunc (sqrt (k 5 - i * tan (k))) MOD N as H (k)?

Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Structures and data processing algorithms.

Terms: Structures and data processing algorithms.