节省时间与并行循环 [英] Save time with parallel FOR loop

查看:212
本文介绍了节省时间与并行循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有关于并行循环的问题。我有以下的code:

I have a question concerning parallel for loops. I have the following code:

    public static void MultiplicateArray(double[] array, double factor)
    {
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = array[i] * factor;
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        }
    }

现在我尝试添加并行功能:

Now I try to add parallel function:

    public static void MultiplicateArray(double[] array, double factor)
    {
        Parallel.For(0, array.Length, i =>
            {
                array[i] = array[i] * factor;
            });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        });
    }

问题是,我想节省时间,不浪费它。随着循环的标准它计算约2分钟,而与平行于循环需要3分钟。为什么呢?

The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?

推荐答案

的Parallel.For()可以通过并行的code提高性能提升不少,但它也有开销(线程间的同步,调用在每次迭代的代表)。而且,由于在code,每次迭代的非常的短(基本上,短短CPU指令),这种开销会变得突出。

Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.

由于这个原因,我想用的Parallel.For()是不适合你的正确的解决方案。相反,如果你手工并行化code(这是在这种情况下很简单的),你可能会看到性能提升。

Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.

要验证这一点,我进行了一些测试:我跑了 MultiplicateArray的不同实现()的200.000.000项目(code我用的是一个数组下面)。在我的机器,串行版本一致了0.21和的Parallel.For()通常把周围的东西0.45 S,但不时,它飙升至8-9小号!

To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200.000.000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!

首先,我会尽力改善通常情况下,我会来的峰值后。我们要处理的数组的 N 的处理器,所以我们每一部分单独分拆成的 N 的同样大小的零部件和工艺。结果? 0.35秒。这仍然比串行版本差。但遍历数组中的每个项目是最优化的结构之一。我们不能做点什么来帮助编译器?提取计算绑定环路可以帮助的。原来,它的作用:0.18秒。这比串行版本更好,但不是很大。而且,有趣的是,改变我的4芯机(无超线程)的并行4至2度不会改变的结果:还是0.18秒。这使我得出这样的结论的CPU是不是这里的瓶颈,内存带宽。

First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.

现在,回穗:我自定义的并行化不会有他们,但的Parallel.For()呢,为什么呢?的Parallel.For <$ C C $>()不使用范围分区,这意味着每个线程处理它自己的阵列的一部分。但是,如果一个线程年初完成,它会尽力帮助处理的尚未完成另一个线程的范围内。如果出现这种情况,你会得到一个很大的错误共享,这可能会减缓codeA不少。而我自己的测试与强迫假共享似乎表明,这可能确实是问题。强制的并行度的的Parallel.For()似乎帮助与尖峰一点。

Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.

当然,所有的这些测量是具体到我的计算机上的硬件,并会为你不同的,所以你应该让自己的测量。

Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.

在code我用:

static void Main()
{
    double[] array = new double[200 * 1000 * 1000];

    for (int i = 0; i < array.Length; i++)
        array[i] = 1;

    for (int i = 0; i < 5; i++)
    {
        Stopwatch sw = Stopwatch.StartNew();
        Serial(array, 2);
        Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelFor(array, 2);
        Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelForDegreeOfParallelism(array, 2);
        Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallel(array, 2);
        Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMax(array, 2);
        Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMaxHalfParallelism(array, 2);
        Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelFalseSharing(array, 2);
        Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
    }
}

static void Serial(double[] array, double factor)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = array[i] * factor;
    }
}

static void ParallelFor(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
        i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount / 2;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    int i = -1;

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                int j = Interlocked.Increment(ref i);
                while (j < array.Length)
                {
                    array[j] = array[j] * factor;
                    j = Interlocked.Increment(ref i);
                }
            });
    }

    Task.WaitAll(tasks);
}

输出示例:

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s

这篇关于节省时间与并行循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆