Search for common elements in an array

Lecture



Task: in an array of length N, find an element that repeats more than N / 2 times.

It would seem, what is there to think? Take a Dictionary <element value, number of occurrences>, for one pass through the array, count the occurrences of each element, then select the desired element from the dictionary. Solution for O ( N ), where could it be even faster?
  Search for common elements in an array
There is one caveat: for the dictionary, we need O ( N ) additional memory - several times larger than the size of the original array, and this is when implementing the dictionary even with a hash table, even a tree. What will we do if our goal is to process the signal with a device with a small memory? Array - measurements of the signal level, of which one is the “real” transmitted level, and the rest are noise and interference. Do you really have to mess around with hash tables and trees to determine the “real” level?

Fortunately, no: there is enough O (1) additional memory, and there is still one pass through the array. The Boyer-Moore algorithm, the same Boyer and Moore, that came up with a much more well-known search algorithm for substrings, is easiest to imagine as follows: at a party N people gathered , and on each one element from the array. When there are two people whose elements are different, they sit down to discuss it. In the end, only people with the same elements will remain standing; Obviously, this is the very element that has been met more than N / 2 times.

The implementation of the Boyer-Moore algorithm takes only a few lines:

 int * majority ( int [] array, int N) {
   int confidence = 0 ;  // the number of people who did not find a pair and left to stand
   int * candidate = NULL ;  // last person not finding a pair -
                          // maybe its element is most common

   // go through the array and sit down couples with different elements
   for ( int i = 0 ; i <N; i ++) {

     // if everyone is still sitting, then the next one will have to stand
     if (confidence == 0) {
       candidate = array + i ;
       confidence ++ ;
     }

     // otherwise, the next one either sits down with one of those standing
     // will either stand with them if he has the same element
     else if (* candidate == array [i])) 
       confidence ++ ;
     else 
       confidence-- ;
   }

   return confidence> 0?  candidate: NULL ;
 }


In the end, we get the “most likely candidate” for presence in N / 2 copies: if such an element really exists in the array, then it will be found; if there is no such element, then perhaps it will remain just some poor fellow who lacked a pair. For a more stringent implementation of the majority, it is required to go through the array a second time and check whether the found element actually occurs the required number of times.
  Search for common elements in an array

Let's complicate the task. Now, in an array of length N, we need to find elements that repeat more than N / 3 times - if there are two, then both, if there is one, then one.

For example, our device with a small memory needs to accept a binary signal with unknown levels of zero and one, but it is known that about half of the time is zero, about half of the time is one, and any signal levels that are different from them are interferences and bounce from switching.

The idea of ​​the past algorithm is also easy to generalize for triples: let people with different elements sit in three. So, at the end of the party there will be a maximum of people with two different elements. If an element has been met more than N / 3 times, then the person with him is guaranteed to remain standing, after all, the sitting threes will get no more than N / 3. As in the past case, if the required elements exist, then they will be found, but if they are not there, then someone can be found.

The code is a little different from the previous one: there is still one pass through the array and O (1) additional memory.

 void majority ( int [] array, int N, int ** cand1, int ** cand2) {
   int conf1 = 0, conf2 = 0 ;  // number of standing people with two elements
   * cand1 = NULL ;  * cand2 = NULL;  // two standing people with different elements

   // go through the array and sit down triples with different elements
   for ( int i = 0 ; i <N; i ++) {

     // does the next have the same element as the ones standing?  then get up with them
     if (conf1> 0 && ** cand1 == array [i])
       conf1 ++ ;
     else if (conf2> 0 && ** cand2 == array [i])
       conf2 ++ ;
     else // can, stand with only one element, or not at all?
       if (conf1 == 0) {
         * cand1 = array + i ;
         conf1 ++ ;
       } else if (conf2 == 0) {
         * cand2 = array + i ;
         conf2 ++ ;
       }
     else { // stand with two other elements, so we set down the whole three
       conf1-- ;
       conf2-- ;
     }
   }

   if (conf1 == 0) * cand1 = NULL ;
   if (conf2 == 0) * cand2 = NULL ;
 }


This algorithm was published in 1982 by American scientists Jayadev Misra and David Gries (Jayadev Misra & David Gries), and their work uses a more boring metaphor - a bag with N numbers from which they extract 3 different numbers for each operation. In addition, their pseudo-code is not similar to any language understandable to the modern programmer. I liked more the explanation of their algorithm in the last year’s lecture notes by American professor Amit Chakrabarti.

  Search for common elements in an array

In the most general form, when in an array of length N you need to find elements that repeat more than N / k times, you will have to use a dictionary. Store in the dictionary, we will only those elements with which people are - that is, no more than k -1 entries.

 int [] majority ( int [] array , int N, int k) {
   // vocabulary of standing people is initially empty
   Dictionary < int , uint > candidates = new Dictionary < int , uint > {};

   // go through the array and shrink the groups by k
   for ( int i = 0; i <N; i ++) {

     // does the next have the same element as the ones standing?  then get up with them
     if (candidates.ContainsKey ( array [i]))
       candidates [ array [i]] ++;
     else // can cost less than k-1 elements?
       if (candidates.Count <k - 1)
         candidates [ array [i]] = 1;
     else // stand with k-1 other elements, so we shrink the whole group
       foreach ( int l in candidates.Keys)
         if (candidates [l] - == 0) // (**)
           candidates.Remove (l);  // (*)
   }

   return candidates.Keys.ToArray ();
 }


In this most general form of the algorithm, there is still one pass through the array and O ( k ) additional memory. If we use a hash table to implement the dictionary, and link all entries in the dictionary to the list, then the overall complexity of the algorithm will remain linear: the line (*) will be executed at most N times after deleting the entry from the dictionary, because at each iteration of the main loop the dictionary is added no more than one entry. As an exercise, readers are asked to understand why the line (**) does not violate the linearity of the algorithm.

Thus, our device with a small memory would be able to communicate with one fluffy beast, recently prepared by habraumelets. The signals of this animal have five logical levels: we assume k = 6, and we get all five levels right on the move, even without storing the signal in memory. It is only necessary to provide a protocol so that all five levels occur in the signal equally often.

Other applications are mentioned for the Misra-Gris algorithm. For example, you can monitor real-time traffic on the network, and if some one host consumes a disproportionately large part of the traffic, start an investigation. You can also monitor clicks on banners, financial transactions, the flow of instructions in the simulated processor ... In general, everywhere where a large number of repetitions is a suspicious anomaly.


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Algorithms

Terms: Algorithms