Hashing in programming using the example of the C language

Lecture



Hashing (hashing) is the process of obtaining the index of an element of an array directly as a result of operations performed on a key that is stored with the element or even coincides with it. The generated index is called a hash address (hash). Traditionally, hashing is applied to disk files as one of the means to reduce access time. However, this general method can also be applied to access sparse arrays. In the previous example with an array of pointers, a special form of hashing was used, which is called direct addressing . In it, each key corresponds to one and only one cell of the array [1] . In other words, each index calculated by hashing is unique. (When representing a sparse array as an array of pointers, the hash function does not have to implement direct addressing — it was just an obvious approach to implementing a spreadsheet.) In real life, direct hashing schemes are rare; usually a more flexible method is required. This section shows how you can generalize the hashing method, giving it greater power and flexibility.

In the example with a spreadsheet, it is clear that even in the most difficult cases, not all table cells are used. Suppose that in almost all cases the actually occupied cells make up no more than 10 percent of the potentially available places. This means that if the table has a size of 260x100 (2`600 cells), only about 260 cells will be used at any given time. This implies that the largest array that will be needed to store all occupied cells will normally consist of only 260 elements. But how do cells of a logical array match this smaller physical array? And what happens when this array is full? Below is one of the possible solutions.

When a user enters data in a cell of a spreadsheet (i.e., fills an element of a logical array), the position of the cell, determined by its name, is used to obtain an index (hash address) in a smaller physical array. When hashing is performed, the physical array is also called the primary array . The index in the primary array is obtained from the cell name, which is converted to a number, just as in the example with the array of pointers. But then this number is divided by 10, resulting in the initial entry point to the primary array. (Remember that in this case the size of the physical array is only 10% of the size of the logical array.) If the cell of the physical array is free at this index, the logical index and data are entered into it. But since 10 logical positions correspond to one physical position, collisions may occur when calculating hash addresses [2] . When this happens, the entries are stored in a linked list, sometimes called a collision list (collision list). Each cell of the primary array is associated with a separate list of collisions. Of course, before the occurrence of a collision, these lists have zero length, as shown in Fig. 23.3.

Fig. 23.3. Hash Example

  Pointer
                on
              list
     Conflict index
    + ------ + ---------- + + - + + - +
  0 |  A1 |  --------> | B1 | ---> | C1 |
    + ------ + ---------- + + - + + - +
  1 |  L1 |  --------> | P1 |
    + ------ + ---------- + + - +
  2 |  A2 |  NULL |
    + ------ + ---------- +
  3 |  M2 |  NULL |
    + ------ + ---------- +
  4 |  |  NULL |
    + ------ + ---------- + Order
    |  .  |  receipts:
    |  .  |  A1
    |  .  |  L1
    + ------ + ---------- + B1
 258 |  A1 |  NULL |  C1
    + ------ + ---------- + P1
 259 |  A1 |  NULL |  A2
    + ------ + ---------- + M2
     Primary array


Suppose you want to find an element in a physical array by its logical index. First, it is necessary to convert the logical index to the corresponding value of the hash address and check whether the logical index stored in the resulting position of the physical array matches the desired one. If yes, information can be retrieved. Otherwise, it is necessary to view the list of collisions for this position until the required logical index is found or until the end of the chain is reached.

The hash example uses an array of structures called primary :

  #define MAX 260

 struct htype {
   int index;  / * logical index * /
   int val;  / * actual value of the data element * /
   struct htype * next;  / * pointer to the next item with the same
                          hash address * /
 } primary [MAX];

Before using this array, you must initialize it. The following function sets the index field to -1 (a value that by definition will never be generated as an index); this value indicates an empty element. A NULL value in the next field corresponds to an empty hash chain [3] .

  / * Initialize the hash array.  * /
 void init (void)
 {
   register int i;

   for (i = 0; i 

The store () procedure converts the cell name to a hash address in the primary primary array. If the position pointed to by the hash address is occupied, the procedure automatically adds an entry to the list of collisions using the modified version of the slstore () function from the previous chapter. The logical index is also preserved, since it will be needed when retrieving the item. These features are shown below:

  / * Calculate hash address and store value.  * /
 void store (char * cell_name, int v)
 {
   int h, loc;
   struct htype * p;

   / * Getting the hash address * /
   loc = * cell_name - 'A';  / * column * /
   loc + = (atoi (& cell_name [1]) - 1) * 26;  / * string * width + column * /
   h = loc / 10;  / * hash * /

   / * Save to received position if it is not occupied.
      or if the logical indices are the same - that is, when updating.
   * /
   if (primary [h] .index == - 1 || primary [h] .index == loc) {
     primary [h] .index = loc;
     primary [h] .val = v;
     return;
   }

   / * otherwise, create a collision list
      or add to engo element * /
   p = (struct htype *) malloc (sizeof (struct htype));
   if (! p) {
     printf ("Out of memory \ n");
     return;
   }
   p-> index = loc;
   p-> val = v;
   slstore (p, & primary [h]);
 }

 / * Adding items to the collision list.  * /
 void slstore (struct htype * i,
              struct htype * start)
 {
   struct htype * old, * p;

   old = start;
   / * find the end of the list * /
   while (start) {
     old = start;
     start = start-> next;
   }
   / * associate with new entry * /
   old-> next = i;
   i-> next = NULL;
 }

In order to get the value of an element, the program must first calculate the hash address and check whether the logical index stored in the received position of the physical array coincides with the desired logical index. If it matches, it returns the value of the cell; otherwise, the search is performed in the list of collisions. The find () function that performs these tasks is shown below:

  / * Calculating the hash address and getting the value.  * /
 int find (char * cell_name)
 {
   int h, loc;
   struct htype * p;

   / * get hash value * /
   loc = * cell_name - 'A';  / * column * /
   loc + = (atoi (& cell_name [1]) - 1) * 26;  / * string * width + column * /
   h = loc / 10;

   / * return value if cell is found * /
   if (primary [h] .index == loc) return (primary [h] .val);
   else {/ * otherwise see a list of collisions * /
     p = primary [h] .next;
     while (p) {
       if (p-> index == loc) return p-> val;
       p = p-> next;
     }
     printf ("No cells in the array \ n");
     return -1;
   }
 }

Creating a delete function is left to the reader as an exercise. ( Hint : Just pay your insertion process.)

The hashing algorithm shown above is very simple. As a rule, in practice, more complex methods are used to ensure a more uniform distribution of indices in the primary array, which eliminates long hash chains. However, the basic principle remains the same.

Hash Method Analysis

At best (quite rare), each physical index calculated by the hash function is unique, and the access time is approximately equal to the access time for direct addressing. This means that collision lists are not created, and all sampling operations are essentially direct access operations. However, this is rarely the case, since for this it is required that the logical indices are evenly distributed in the space of physical indices. In the worst case (also rare), the hashing scheme degenerates into a linked list. This happens when the hash addresses of all logical indexes match. On average (and most likely), the access time for hashing is equal to the access time for direct addressing plus some constant proportional to the average length of the hashing chains. The most important factor in the implementation of sparse arrays by the method of hashing is the choice of such a hashing algorithm, in which physical indices are evenly distributed, thus avoiding the formation of long lists of collisions.

  Hashing in programming using the example of the C language

[1] In other words, the hash function is a bijection.

[2] That is, situations where the same hash address corresponds to different keys k 1 ≠ k 2 : h (k 1 ) = h (k 2 ) (here h is a hash function).

[3] The hash chain is a chain connecting the elements of a hash table with the same hash code. Earlier, the author called her a collision list (collision list). Sometimes it is also called an overflow package .


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Algorithms

Terms: Algorithms