What are the pitfalls in implementing binary search? [closed]

Binary search is harder to implement than it looks. "Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky…" — Donald Knuth.

Which bugs are most likely to be introduced into a new binary search implementation?


Solution 1:

This question was just asked again recently. Apart from Knuth's quote that "Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky", there is the staggering historical fact (see TAOCP, Volume 3, section 6.2.1) that binary search was first published in 1946 but the first published binary search without bugs was in 1962. And there is Bentley's experience that when he assigned binary search in courses for professional programmers at places like Bell Labs and IBM and gave them two hours, everyone reported they had got it right, and on examining their code 90% of them had bugs — year after year.

Perhaps the basic reason why so many programmers make mistakes with binary search, besides Sturgeon's Law, is that they are not careful enough: Programming Pearls quotes this as the "write your code, throw it over the wall, and have Quality Assurance or Testing deal with the bugs" approach. And there is a lot of scope for error. Not just the overflow error that several of the other answers here mention, but logical errors.

Below are some examples of binary-search errors. This is by no means exhaustive. (As Tolstoy writes in Anna Karenina — "All happy families are alike; every unhappy family is unhappy in its own way" — every incorrect binary search program is incorrect in its own way.)

Pattis

The following Pascal code is taken from the paper Textbook errors in binary searching (1988) by Richard E Pattis. He looked at twenty textbooks and came up with this binary search (BTW, Pascal uses array indexes starting from 1):

PROCEDURE BinarySearch (A         : anArray,
                        Size      : anArraySize,
                        Key       : INTEGER,
                        VAR Found : BOOLEAN;
                        VAR Index : anArrayIndex);
Var Low, High : anArrayIndex;
BEGIN         
   LOW := 1;
   High := Size;
   
   REPEAT
      Index := (Low + High) DIV 2;
      If Key < A[Index]
         THEN High := Index - 1
         ELSE Low  := Index + 1
   UNTIL (Low > High) OR (Key = A[Index]);

   FOUND := (Low <= High)
END;

Looks ok? This has more than one error. Before reading further, see if you can find them all. You should be able to guess what the code does even if you're seeing Pascal for the first time.


He describes five errors that many programs have, and in particular the above has:

Error 1: It doesn't run in O(log n) time, where n = Size. In their enthusiasm for Proper Programming Practice, some programmers write binary search as a function/procedure, and pass it an array. (This is not specific to Pascal; imagine in C++ passing a vector by value rather than by reference.) This is Θ(n) time just to pass the array to the procedure, which defeats the whole purpose. Even worse, some authors apparently give a recursive binary search, which passes an array each time, giving a running time that is Θ(n log n). (This is not far-fetched; I've actually seen code like this.)

Error 2: It fails when size = 0. This may be ok. But depending on the intended application, the list/table being searched may shrink to 0, and it must be handled somewhere.

Error 3: It gives a wrong answer. Whenever the final iteration of the loop starts with Low=High (e.g. when Size=1), it sets Found:=False, even if Key is in the array.

Error 4: It causes an error whenever Key is less than the smallest element of the array. (After Index becomes 1, it sets High to 0, etc.; causes an out-of-bounds error.)

Error 5: It causes an error whenever Key is greater than the largest element of the array. (After Index becomes Size, it sets Low to Size+1 etc.; causes an out-of-bounds error.)

He also points out that some obvious ways to "fix" these errors turn out to be wrong as well. Real-life code also often has this property, when a programmer wrote something incorrect, found an error, and then "fixed" it until it seemed correct without thinking carefully enough.

Of the 20 textbooks he tried, only 5 had a correct binary search. In the remaining 15 (he says 16, ironically), he found 11 instances of Error 1, six instances of Error 2, two each of Errors 3 and 4, and one of Error 5. These numbers add up to much more than 15, because several of them had multiple errors.


More examples

Binary search is used for more than just searching an array to see if it contains a value, so here is one more example for now. I may update this list as I think of more.

Suppose you have an increasing (non-decreasing) function f: R->R, and (because you want a root of f, say), you want to find the largest t such that f(t) < 0. See how many bugs you can find in the following:

float high = INF, low = 0;
while(high != low) {
   float mid = (low + high)/2;
   if(f(mid)>0) high=mid;
   else low=mid;
}
printf("%f", high);

(Some: There may be no such t in [0,INF], if f is 0 on an interval then this is wrong, never compare floating point numbers for equality, etc.)

How to write binary search

I used to make several of those errors — the first few dozen times I wrote binary search (which was during programming contests with time pressure), about 30% of the time there was a bug somewhere — until I found the simple way to write it correctly. I haven't got a binary search wrong since (as I recall). The trick is very simple:

Maintain an invariant.

Find/decide and make explicit some invariant property that your "low" and "high" variables satisfy throughout the loop: before, during and after. Make sure it is never violated. Of course you also need to think about the termination condition. This is explained in good detail in Chapter 4 of Programming Pearls which derives a binary search program from semi-formal methods.

For instance, to slightly abstract out the condition being examined, suppose you want to find the largest integer value x for which some condition poss(x) is true. Even this explicitness of problem definition is more than many programmers start with. (For instance, poss(x) may be a[x] ≤ v for some value v; this is to find out how many elements in the sorted array a are grater than v, say.) Then, one way to write binary search is:

int lo=0, hi=n;
//INVARIANT: poss(lo) is true, poss(hi) is false
//Check and ensure invariant before starting binary search
assert(poss(lo)==true);
assert(poss(hi)==false);
while(hi-lo>1) {
    int mid = lo + (hi-lo)/2;
    if(poss(mid)) lo = mid;
    else hi = mid;
}
printf("%d \n",lo);

You can add more assert statements and other checks, but the basic idea is that because you update lo to mid only when you know that poss(mid) is true, you maintain the invariant that poss(lo) is always true. Similarly, you set hi to mid only when poss(mid) is false, so you maintain the invariant that poss(hi) is always false. Think about the termination condition separately. (Note that when hi-lo is 1, mid is the same as lo. So don't write the loop as while(hi>lo), or you'll have a infinite loop.) At the end of the loop, you're guaranteed that hi-lo is at most 1, and because your always maintained your invariant (poss(lo) is true and poss(hi) is false), it cannot be 0. Also, again because of your invariant, you know that lo is the value to return/print/use. There are other ways to write binary search of course, but maintaining an invariant is a trick/discipline that always helps.

Solution 2:

Here are some I can think of:

  • Off-by-one errors, when determining the boundary of the next interval
  • Handling of duplicate items, if you are suppose to return the first equal item in the array but instead returned a subsequent equal item
  • Numerical underflows/overflows when computing indices, with huge arrays
  • Recursive vs non-recursive implementation, a design choice you should consider

Are these what you have in mind?