为什么50线程比4快? [英] Why are 50 threads faster than 4?

查看:248
本文介绍了为什么50线程比4快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DWORD WINAPI MyThreadFunction(LPVOID lpParam) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}

此函数在 MAX_THREADS threads。
我已在 Intel Core 2 Duo Windows 7 MS Visual Studio 2012 MAX_THREADS = 4 MAX_THREADS = 50
test1 7.1秒内完成(4个线程),但在 5.8秒内完成 test2 (50个线程) code> test1 有比 test2 更多的上下文切换。

我在

This function is run in MAX_THREADS threads.
I have run the tests on Intel Core 2 Duo, Windows 7, MS Visual Studio 2012 using Concurrency Visualizer with MAX_THREADS=4 and MAX_THREADS=50.
test1 (4 threads) completed in 7.1 seconds, but test2 (50 threads) completed in 5.8 seconds while test1 has more context switches than test2.
I have run the same tests on Intel Core i5, Mac OS 10.7.5 and got the same results.

推荐答案

>我决定在我的4核机器上自己做基准。我直接比较4线程与50线程通过交错100个测试每个。我使用自己的数字,以便每个任务都有一个合理的执行时间。

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task.

结果是你所描述的。 50线程版本略快。下面是我的结果框图:

The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results:

为什么?我想这归结为线程调度。任务不完成,直到所有线程都完成他们的工作,每个线程必须做一个四分之一的工作。因为您的进程正在与系统上的其他进程共享,如果任何单个线程切换到另一个进程,这将延迟整个任务。当我们等待最后一个线程完成时,所有其他内核都处于空闲状态。注意,4线程测试的时间分布比50线程测试要宽得多,我们可以预期。

Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect.

当你使用50个线程时,每个线程都有少做。因此,单个线程中的任何延迟对总时间的影响不太明显。当调度器忙于将核心配给大量短线程时,可以通过向另一个核心给予这些线程时间来补偿一个核心上的延迟。

When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper.

因此,在这种情况下,额外的上下文切换不是最大的因子。虽然增益很小,但是考虑到处理比上下文切换更重要,似乎有利于使线程调度器稍微淹没。

So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application.

[edit] 出于好奇,我在一个测试过夜,而我的电脑没有做很多其他。这次我每次测试使用200个样本。再次,测试交错以减少任何本地化后台任务的影响。

[edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks.

这些结果的第一个图是低线程计数(高达3倍的核心)。你可以看到线程数的一些选择是相当差的...也就是说,任何不是核心数的倍数,特别是奇数值。

The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values.

情节是为了更高的线程计数(从3倍的内核数量到60)。

The second plot is for higher thread-counts (from 3 times the number of cores up to 60).

上面,线程计数增加时,可以看到明确的下降趋势。

Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases.

在这个测试中,有趣的是注意到4线程和50线程的性能测试大致相同,并且在4核测试中的结果的扩展不如我的原始测试宽。因为计算机没有做其他事情,它可以花时间进行测试。

In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load.

为了保持透视,请考虑这一点:

And just to keep things in perspective, consider this:

[另一个编辑] 发布我的最后一批结果后,我注意到,混乱的盒子图显示了4的倍数的测试趋势,但数据有点难看。

[Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see.

我决定做一个只有四的倍数的测试,并认为我可以找到减少收益的点同一时间。所以我使用的线程计数是2的权力,高达1024.我会更高,但Windows在大约1400线程漏洞。

I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads.

结果是相当不错, 我认为。如果你想知道小圈子是什么,这些是中间值。我选择它而不是我以前使用的红线,因为它更清晰地显示趋势。

The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly.

似乎在这种特殊情况下,工资的污点在50到150之间线程。之后,效益很快就会消失,我们正在进入过多的线程管理和上下文切换的领域。

It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching.

结果可能会随着更长或更短的时间任务。在这种情况下,这是一个涉及大量无意义算法的任务,花费大约18秒在单个核心上进行计算。

The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core.

通过仅调整线程数,我能够削减4线程版本的中间执行时间的额外1.5%到2%。

By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

这篇关于为什么50线程比4快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆