Why prefer start + (end - start) / 2 over (start + end) / 2 when calculating the middle of an array?
I've seen programmers use the formula
mid = start + (end - start) / 2
instead of using the simpler formula
mid = (start + end) / 2
for finding the middle element in the array or list.
Why do they use the former one?
Solution 1:
There are three reasons.
First of all, start + (end - start) / 2
works even if you are using pointers, as long as end - start
doesn't overflow1.
int *start = ..., *end = ...;
int *mid = start + (end - start) / 2; // works as expected
int *mid = (start + end) / 2; // type error, won't compile
Second of all, start + (end - start) / 2
won't overflow if start
and end
are large positive numbers. With signed operands, overflow is undefined:
int start = 0x7ffffffe, end = 0x7fffffff;
int mid = start + (end - start) / 2; // works as expected
int mid = (start + end) / 2; // overflow... undefined
(Note that end - start
may overflow, but only if start < 0
or end < 0
.)
Or with unsigned arithmetic, overflow is defined but gives you the wrong answer. However, for unsigned operands, start + (end - start) / 2
will never overflow as long as end >= start
.
unsigned start = 0xfffffffeu, end = 0xffffffffu;
unsigned mid = start + (end - start) / 2; // works as expected
unsigned mid = (start + end) / 2; // mid = 0x7ffffffe
Finally, you often want to round towards the start
element.
int start = -3, end = 0;
int mid = start + (end - start) / 2; // -2, closer to start
int mid = (start + end) / 2; // -1, surprise!
Footnotes
1 According to the C standard, if the result of pointer subtraction is not representable as a ptrdiff_t
, then the behavior is undefined. However, in practice, this requires allocating a char
array using at least half the entire address space.
Solution 2:
We can take a simple example to demonstrate this fact. Suppose in a certain large array, we are trying to find the midpoint of the range [1000, INT_MAX]
. Now, INT_MAX
is the largest value the int
data type can store. Even if 1
is added to this, the final value will become negative.
Also, start = 1000
and end = INT_MAX
.
Using the formula: (start + end)/2
,
the mid-point will be
(1000 + INT_MAX)/2
=-(INT_MAX+999)/2
, which is negative and may give segmentation fault if we try to index using this value.
But, using the formula, (start + (end-start)/2)
, we get:
(1000 + (INT_MAX-1000)/2)
=(1000 + INT_MAX/2 - 500)
=(INT_MAX/2 + 500)
which will not overflow.
Solution 3:
To add to what others have already said, the first one explains its meaning clearer to those less mathematically minded:
mid = start + (end - start) / 2
reads as:
mid equals start plus half of the length.
whereas:
mid = (start + end) / 2
reads as:
mid equals half of start plus end
Which does not seem as clear as the first, at least when expressed like that.
as Kos pointed out it can also read:
mid equals the average of start and end
Which is clearer but still not, at least in my opinion, as clear as the first.