Save time with parallel FOR loop

I have a question concerning parallel for loops. I have the following code:

    public static void MultiplicateArray(double[] array, double factor)
    {
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = array[i] * factor;
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        }
    }

Now I try to add parallel function:

    public static void MultiplicateArray(double[] array, double factor)
    {
        Parallel.For(0, array.Length, i =>
            {
                array[i] = array[i] * factor;
            });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        });
    }

The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?


Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.

Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.

To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!

First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.

Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.

Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.

The code I used:

static void Main()
{
    double[] array = new double[200 * 1000 * 1000];

    for (int i = 0; i < array.Length; i++)
        array[i] = 1;

    for (int i = 0; i < 5; i++)
    {
        Stopwatch sw = Stopwatch.StartNew();
        Serial(array, 2);
        Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelFor(array, 2);
        Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelForDegreeOfParallelism(array, 2);
        Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallel(array, 2);
        Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMax(array, 2);
        Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMaxHalfParallelism(array, 2);
        Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelFalseSharing(array, 2);
        Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
    }
}

static void Serial(double[] array, double factor)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = array[i] * factor;
    }
}

static void ParallelFor(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
        i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount / 2;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    int i = -1;

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                int j = Interlocked.Increment(ref i);
                while (j < array.Length)
                {
                    array[j] = array[j] * factor;
                    j = Interlocked.Increment(ref i);
                }
            });
    }

    Task.WaitAll(tasks);
}

Example output:

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s

Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For() but that you have to process larger chunks of data.

This can still be done using Parallel.For() like this:

static void My(double[] array, double factor)
{
    int degreeOfParallelism = Environment.ProcessorCount;

    Parallel.For(0, degreeOfParallelism, workerId =>
    {
        var max = array.Length * (workerId + 1) / degreeOfParallelism;
        for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
            array[i] = array[i] * factor;
    });
}

which does the same thing as svicks CustomParallelExtractedMax() but is shorter, simpler and (on my machine) performs even slightly faster:

Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s

Btw, the keyword for this which is missing from all the other answers is granularity.


See Custom Partitioners for PLINQ and TPL:

In a For loop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Create overloads to create an IEnumerable<T> of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEach method whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.

In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.

Try this:

public static void MultiplicateArray(double[] array, double factor)
{
    var rangePartitioner = Partitioner.Create(0, array.Length);

    Parallel.ForEach(rangePartitioner, range =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
        {
            array[i] = array[i] * factor;
        }
    });
}

See also: Parallel.ForEach documentation and Partitioner.Create documentation.