Is "==" in sorted array not faster than unsorted array? [duplicate]
Note: the alleged duplicate question is, I think, mostly related to "<" and ">" comparison, but not "==" comparison and hence does not answer my question about performance of the "==" operator.
For a long time, I have believed that "processing" a sorted array should be faster than an unsorted array. At first, I thought using "==" in a sorted array should be faster than in unsorted array because - I guess - of how branch prediction works:
UNSORTEDARRAY:
5 == 100 F
43 == 100 F
100 == 100 T
250 == 100 F
6 == 100 F
(other elements to check)
SORTEDARRAY:
5 == 100 F
6 == 100 F
43 == 100 F
100 == 100 T
(no need to check other elements, so all are F)
so I guess SORTEDARRAY should be faster than UNSORTEDARRAY, but today I used the code to generate 2 arrays in a header to test, and branch prediction seemed not to work as I thought.
I generated an unsorted array and a sorted array to test:
srand(time(NULL));
int UNSORTEDARRAY[524288];
int SORTEDARRAY[sizeof(UNSORTEDARRAY)/sizeof(int)];
for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
SORTEDARRAY[i]=UNSORTEDARRAY[i]=rand();
}
sort(SORTEDARRAY,SORTEDARRAY+sizeof(SORTEDARRAY)/sizeof(int));
string u="const int UNSORTEDARRAY[]={";
string s="const int SORTEDARRAY[]={";
for(int i=0;i<sizeof(UNSORTEDARRAY)/sizeof(int);i++){
u+=to_string(UNSORTEDARRAY[i])+",";
s+=to_string(SORTEDARRAY[i])+",";
}
u.erase(u.end()-1);
s.erase(s.end()-1);
u+="};\n";
s+="};\n";
ofstream out("number.h");
string code=u+s;
out << code;
out.close();
so to test, just count if the value is == RAND_MAX/2 like this:
#include "number.h"
int main(){
int count;
clock_t start = clock();
for(int i=0;i<sizeof(SORTEDARRAY)/sizeof(int);i++){
if(SORTEDARRAY[i]==RAND_MAX/2){
count++;
}
}
printf("%f\n",(float)(clock()-start)/CLOCKS_PER_SEC);
}
run 3 times:
UNSORTEDARRAY
0.005376
0.005239
0.005220
SORTEDARRAY
0.005334
0.005120
0.005223
it seems a small performance difference, so I didn't believe it and then tried to change "SORTEDARRAY[i]==RAND_MAX/2" to "SORTEDARRAY[i]>RAND_MAX/2" to see if it made a difference:
UNSORTEDARRAY
0.008407
0.008363
0.008606
SORTEDARRAY
0.005306
0.005227
0.005146
this time there is a great difference.
Is "==" in the sorted array not faster than an unsorted array? If yes, why is ">" in sorted array faster than an unsorted array but "==" is not?
One thing that immediately comes to mind is CPU's branch prediction algorithm.
In case of >
comparison, in sorted array the branching behavior is very consistent: first, the if
condition is consistently false, then it is consistently true. This aligns very well with even the simplest branch prediction.
In unsorted array the result of >
condition is essentially random, thus thwarting any branch prediction.
This is what makes the sorted version faster.
In case of ==
comparison, most of the time the condition is false and only very rarely it is true. This works well with branch prediction regardless of whether the array is sorted or not. The timings are essentially the same.
N.B. I'm making this an answer since it's too long for a comment.
The effect here is exactly what is already explained in great detail in the plentiful answers to this question. Processing a sorted array was faster in that case because of branch prediction.
Here, the culprit is again branch prediction. The ==
test is very seldom true, so branch prediction works about the same for both. When you changed it to >
, then you got the behavior explained in that question, and for the same reason.
Moral:
I believe "processing" a sorted array should be faster than [an ]unsorted array.
You need to know why. This isn't some magical rule, and it isn't always true.
The comparison ==
has less of a relation to ordering than >
does. Correctly or incorrectly predicting ==
would only depend on the number of duplicate values and their distribution.
You can use perf stat
to view the performance counters...
jason@io /tmp $ lz4 -d ints | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes
Performance counter stats for './proc-eq':
5226.932577 task-clock (msec) # 0.953 CPUs utilized
31 context-switches # 0.006 K/sec
24 cpu-migrations # 0.005 K/sec
3,479 page-faults # 0.666 K/sec
15,763,486,767 cycles # 3.016 GHz
4,238,973,549 stalled-cycles-frontend # 26.89% frontend cycles idle
<not supported> stalled-cycles-backend
31,522,072,416 instructions # 2.00 insns per cycle
# 0.13 stalled cycles per insn
8,515,545,178 branches # 1629.167 M/sec
10,261,743 branch-misses # 0.12% of all branches
5.483071045 seconds time elapsed
jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-eq >/dev/null
Successfully decoded 104824717 bytes
Performance counter stats for './proc-eq':
5536.031410 task-clock (msec) # 0.348 CPUs utilized
198 context-switches # 0.036 K/sec
21 cpu-migrations # 0.004 K/sec
3,604 page-faults # 0.651 K/sec
16,870,541,124 cycles # 3.047 GHz
5,300,218,855 stalled-cycles-frontend # 31.42% frontend cycles idle
<not supported> stalled-cycles-backend
31,526,006,118 instructions # 1.87 insns per cycle
# 0.17 stalled cycles per insn
8,516,336,829 branches # 1538.347 M/sec
10,980,571 branch-misses # 0.13% of all branches
jason@io /tmp $ lz4 -d ints | perf stat ./proc-gt >/dev/null
Successfully decoded 104824717 bytes
Performance counter stats for './proc-gt':
5293.065703 task-clock (msec) # 0.957 CPUs utilized
38 context-switches # 0.007 K/sec
50 cpu-migrations # 0.009 K/sec
3,466 page-faults # 0.655 K/sec
15,972,451,322 cycles # 3.018 GHz
4,350,726,606 stalled-cycles-frontend # 27.24% frontend cycles idle
<not supported> stalled-cycles-backend
31,537,365,299 instructions # 1.97 insns per cycle
# 0.14 stalled cycles per insn
8,515,606,640 branches # 1608.823 M/sec
15,241,198 branch-misses # 0.18% of all branches
5.532285374 seconds time elapsed
jason@io /tmp $ lz4 -d ints | sort -n | perf stat ./proc-gt >/dev/null
15.930144154 seconds time elapsed
Performance counter stats for './proc-gt':
5203.873321 task-clock (msec) # 0.339 CPUs utilized
7 context-switches # 0.001 K/sec
22 cpu-migrations # 0.004 K/sec
3,459 page-faults # 0.665 K/sec
15,830,273,846 cycles # 3.042 GHz
4,456,369,958 stalled-cycles-frontend # 28.15% frontend cycles idle
<not supported> stalled-cycles-backend
31,540,409,224 instructions # 1.99 insns per cycle
# 0.14 stalled cycles per insn
8,516,186,042 branches # 1636.509 M/sec
10,205,058 branch-misses # 0.12% of all branches
15.365528326 seconds time elapsed