Sorting disk files with random selection

Lecture



Disk files are of two types: with sequential sampling (access) and random sampling . If a disk file of any type is small enough, it can be read into memory and sorted by one of the array sorting procedures presented above. However, many disk files are too large to sort in memory, and therefore require special sorting techniques. Many database applications work with files with a random selection. This section shows one way to sort such files.

Disk files with random access have two big advantages over files with sequential selection. First, it's easy to work with them. To update the information you do not need to copy the entire list. Secondly, they can be considered as a very large array on disk, which greatly simplifies sorting.

The fact that a file with a random selection can be viewed as an array means that you can apply a quick sort to it with only a few changes. Instead of accessing the array elements by index, the disk quick sort should use the fseek () function to find the corresponding disk entry.

In each real situation, the sorting is determined by the specific structure of the data being sorted and the sorting key. Nevertheless, the general idea of ​​sorting disk files with random selection can be understood by the example of a short program sorting structures of the address type — the mail structures described earlier. This program, below, first creates a disk file containing unordered addresses. Then it sorts this file. The number of addresses to be sorted is indicated by the constant NUM_ELEMENTS (which is equal to 4 in this program). However, in real life, the number of entries will have to be tracked dynamically. Experiment with this program yourself, trying different types of structures containing various data:

  / * Disk sorting of address type structures.  * /
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

 #define NUM_ELEMENTS 4 / * Number of elements - arbitrary
                            the number to be determined
                            dynamically for each list.  * /

 struct address {
   char name [30];
   char street [40];
   char city [20];
   char state [3];
   char zip [11];
 } ainfo;

 struct address addrs [NUM_ELEMENTS] = {
   "A. Alexander", "101 1st St", "Olney", "Ga", "55555",
   "B. Bertrand", "22 2nd Ave", "Oakland", "Pa", "34232",
   "C. Carlisle", "33 3rd Blvd", "Ava", "Or", "92000",
   "D. Dodger", "4 Fourth Dr", "Fresno", "Mi", "45678"
 };

 void quick_disk (FILE * fp, int count);
 void qs_disk (FILE * fp, int left, int right);
 void swap_all_fields (FILE * fp, long i, long j);
 char * get_zip (FILE * fp, long rec);

 int main (void)
 {
   FILE * fp;

   / * first create a data file to be sorted * /
   if ((fp = fopen ("mlist", "wb")) == NULL) {
     printf ("Cannot open file for writing. \ n");
     exit (1);
   }
   printf ("Write unordered data to disk file. \ n");
   fwrite (addrs, sizeof (addrs), 1, fp);
   fclose (fp);

   / * now sort the file * /
   if ((fp = fopen ("mlist", "rb +")) == NULL) {
     printf ("Cannot open file for read / write. \ n");
     exit (1);
   }

   printf ("Sort disk file. \ n");
   quick_disk (fp, NUM_ELEMENTS);
   fclose (fp);
   printf ("File sorted. \ n");

   return 0;
 }

 / * Quick file sorting.  * /
 void quick_disk (FILE * fp, int count)
 {
   qs_disk (fp, 0, count-1);
 }

 void qs_disk (FILE * fp, int left, int right)
 {
   long int i, j;
   char x [100];

   i = left;  j = right;

   strcpy (x, get_zip (fp, (long) (i + j) / 2));  / * get average postal
                                             index * /

   do {
     while ((strcmp (get_zip (fp, i), x) <0) && (i <right)) i ++;
     while ((strcmp (get_zip (fp, j), x)> 0) && (j> left)) j--;

     if (i <= j) {
       swap_all_fields (fp, i, j);
       i ++;  j--;
     }
   } while (i <= j);

   if (left <j) qs_disk (fp, left, (int) j);
   if (i <right) qs_disk (fp, (int) i, right);
 }

 void swap_all_fields (FILE * fp, long i, long j)
 {
   char a [sizeof (ainfo)], b [sizeof (ainfo)];

   / * read i and j in memory * /
   fseek (fp, sizeof (ainfo) * i, SEEK_SET);
   fread (a, sizeof (ainfo), 1, fp);

   fseek (fp, sizeof (ainfo) * j, SEEK_SET);
   fread (b, sizeof (ainfo), 1, fp);

   / * then burn them to disk, swapping * /
   fseek (fp, sizeof (ainfo) * j, SEEK_SET);
   fwrite (a, sizeof (ainfo), 1, fp);
   fseek (fp, sizeof (ainfo) * i, SEEK_SET);
   fwrite (b, sizeof (ainfo), 1, fp);
 }

 / * Getting a pointer to the postal code entry * /
 char * get_zip (FILE * fp, long rec)
 {
   struct address * p;

   p = & ainfo;

   fseek (fp, rec * sizeof (ainfo), SEEK_SET);
   fread (p, sizeof (ainfo), 1, fp);

   return ainfo.zip;
 }

As you probably already noticed, I had to write two auxiliary functions to sort the address records. In the comparison fragment, the get_zip () function is used, which returns a pointer to the zip field of the compared records. The swap_all_fields () function exchanges data from two records. The order of read and write operations has a great influence on the speed of sorting. When exchanging two records, the program moves the pointer of the current record in the file first to the record i , and then to the record j . As long as the disk head is above the j record, j record data is written. This means that at this moment the head does not have to move a long distance. If the code were composed in such a way that the record i was recorded first, one more movement of the head would be needed.


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Algorithms

Terms: Algorithms