Why is processing a sorted array faster than processing an unsorted array?
Here is a piece of C++ code that shows some very peculiar behavior.
For some reason, sorting the data (before the timed region) miraculously makes the primary loop almost six times faster:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}
- Without
std::sort(data, data + arraySize);
, the code runs in 11.54 seconds. - With the sorted data, the code runs in 1.93 seconds.
(Sorting itself takes more time than this one pass over the array, so it’s not actually worth doing if we needed to calculate this for an unknown array.)
Initially, I thought this might be just a language or compiler anomaly, so I tried Java:
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// !!! With this, the next loop runs faster
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
for (int c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
With a similar but less extreme result.
My first thought was that sorting brings the data into the cache, but that’s silly because the array was just generated.
- What is going on?
- Why is processing a sorted array faster than processing an unsorted array?
The code is summing up some independent terms, so the order should not matter.
Related / follow-up Q&As about the same effect with different/later compilers and options:
- Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?
- gcc optimization flag -O3 makes code slower than -O2
- 30@screwnut here’s an experiment which would show that partitioning is sufficient: create an unsorted but partitioned array with otherwise random contents. Measure time. Sort it. Measure time again. The two measurements should be basically indistinguishable. (Experiment 2: create a random array. Measure time. Partition it. Measure time again. You should see the same speed-up as sorting. You could roll the two experiments into one.) – Jonas Kölker Oct 5, 2020 at 8:26
- 27Btw. on Apple M1 the code runs in 17 sec unsorted, and in 7 sec sorted, so the branch prediction penalty isn’t that bad on risc architecture. – Piotr Czapla Mar 31, 2021 at 9:07
- 24@RomanYavorskyi: It depends on the compiler. If they make branchless asm for this specific test (e.g. as part of vectorizing with SIMD like in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?, or just with scalar
cmov
(gcc optimization flag -O3 makes code slower than -O2), then sorted or not doesn’t matter. But unpredictable branches are still a very real thing when it’s not as simple as counting, so it would be insane to delete this question. – Peter Cordes Apr 15, 2021 at 6:31 - 10@screwnut: To be fair, though, it still wouldn’t be worth partitioning first, because partitioning requires conditional copying or swapping based on the same
array[i] > 128
compare. (Unless you’re going to be counting multiple times, and you want to partition the bulk of an array so it’s still fast, only mispredicting in an unpartitioned part after some appends or modifications). If you can get the compiler to do it, ideally just vectorize with SIMD, or at least use branchless scalar if the data is unpredictable. (See previous comment for links.) – Peter Cordes Apr 15, 2021 at 6:46