用的Parallel.For表现令人失望 [英] Disappointing performance with Parallel.For

查看:198
本文介绍了用的Parallel.For表现令人失望的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用的Parallel.For 来加快我的计算时间。我有8个内核的英特尔酷睿i7 Q840 CPU,但我只设法得到比起依次回路4性能比。这是好,因为它可以用的Parallel.For 获得,或者可以调用该方法进行微调,以提高性能?

I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?

下面是我的测试代码,顺序:

Here is my test code, sequential:

var loops = 200;
var perloop = 10000000;

var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
    var sumk = 0.0;
    for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
    sum += sumk;
}

和并行:

sum = 0.0;
Parallel.For(0, loops,
                k =>
                    {
                        var sumk = 0.0;
                        for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
                        sum += sumk;
                    });

这是我并行化循环涉及计算具有全局定义的变量,,但这应该只是达到的并行化循环内的总时间,很小,很小的部分。

The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.

在Release生成(优化代码标志集)顺序循环发生在我的电脑上33.7 S,而的Parallel.For 循环需要8.4秒,只有4.0的性能比。

In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.

在任务管理器,我可以看到,CPU利用率为10-11%时连续计算,而这是平行计算过程中只有70%。我试图明确设置

In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set

ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount

但无济于事。为什么不是所有的CPU被分配到的平行的计算并不清楚我

but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?

我已经注意到了类似的问题已经提出了对SO < A HREF =http://stackoverflow.com/questions/7734295/how-to-get-max-performance-using-parallel-for-foreach-performance-timings-incl> 之前,具有偶数更令人失望的结果。然而,这一问题还参与了第三方库逊色并行化。我主要关注的是在核心库的基本操作的并行化。

I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.

更新

有人向我指出一些,我现在用的CPU只有4个物理核心的意见,如果启用了超线程这是该系统作为8个内核可见。对于它的缘故,我禁用超线程和重新基准。

It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.

通过超线程的停用的,我的计算,现在的更快的,无论是平行的,也是(我认为是)连续循环。为循环中的 CPU利用率高达约45%(!)和的Parallel.For 循环期间100%。

With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.

计算时间循环15.6秒(超过快两倍,超线程的启用的)和6.2 S代表的Parallel.For (比当超线程不如25%的启用的)。性能比与的Parallel.For 现在只有 2.5 ,4真正的内核上运行。

Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.

所以性能比仍然基本上低于预期,尽管超线程被停用。在另一方面,这是耐人寻味的CPU利用率是循环中的这么高?难道还有一些内部的并行在这个循环中去的为好?

So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?

推荐答案

使用全局变量可以引入显著同步问题即使你不使用锁。当你赋值给变量每个核心将获得在系统内存中访问到同一个地方,或者等待其他核心来访问它之前完成。
可以通过使用更轻的 Interlocked.Add 方法添加一个值的总和原子,在操作系统级别,但你仍然会得到应有争的延迟。

Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it. You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.

正确的方式做,这是更新一个线程局部变量创建部分和和他们都在最后添加到一个单一的全球总和。 的Parallel.For 都有,不只是这个过载。 MSDN甚至有使用Sumation公司在如何一个例子:写的Parallel.For循环具有线程局部变量

The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables

        int[] nums = Enumerable.Range(0, 1000000).ToArray();
        long total = 0;

        // Use type parameter to make subtotal a long, not an int
        Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
        {
            subtotal += nums[j];
            return subtotal;
        },
            (x) => Interlocked.Add(ref total, x)
        );

每个线程更新自己的小计的值,并更新了全球的的使用Interlocked.Add当它完成。

Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.

这篇关于用的Parallel.For表现令人失望的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆