为什么拥有比核心更快的线程? [英] Why is Having More Threads than Cores Faster?

查看:79
本文介绍了为什么拥有比核心更快的线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在多线程版本中实现了PageRank版本.我正在4核Q6600上运行它.当我运行它以创建4个线程时,得到:

I've implemented a version of PageRank in a multithreaded version. I'm running it on a 4-core Q6600. When I run it set to create 4 threads, I get:

real    6.968s
user   26.020s
sys     0.050s

当我运行128个线程时,我得到:

When I run with 128 threads I get:

real    0.545s
user    1.330s
sys     0.040s

这对我来说毫无意义.基本算法是求和:

This makes no sense to me. The basic algorithm is a sum-reduce:

  1. 所有线程对输入的一部分求和;
  2. 同步;
  3. 每个线程然后从其他线程中累加部分结果;
  4. 主线程对所有线程的中间值求和,然后确定是否继续.

配置没有帮助.我不确定哪些数据会有助于理解我的代码-请问一下.

Profiling hasn't helped. I'm not sure what data would be helpful to understand my code - please just ask.

这真的让我感到困惑.

推荐答案

故意创建比处理器更多的线程是一种用于利用备用周期"的标准技术,在这种情况下,线程被阻塞以等待某件事,无论那是I/O ,互斥或其他方式,可以为处理器提供其他有用的功能.

Deliberately creating more threads than processors is a standard technique used to make use of "spare cycles" where a thread is blocked waiting for something, whether that's I/O, a mutex, or something else by providing some other useful work for the processor to do.

如果您的线程正在执行I/O,那么这是提高速度的有力竞争者:由于每个线程都在阻塞等待I/O的时间,因此处理器可以运行其他线程,直到它们也阻塞了I/O为止,希望在哪个时间准备好第一个线程的数据,依此类推.

If your threads are doing I/O then this is a strong contender for the speed-up: as each thread blocks waiting for the I/O, the processor can run the other threads until they too block for I/O, hopefully by which time the data for the first thread is ready, and so forth.

另一个导致速度加快的原因是您的线程遇到了错误共享.如果您有两个线程在同一高速缓存行(例如,数组的相邻元素)上将数据写入不同的值,则这将阻塞CPU,同时来回传输高速缓存行.通过添加更多线程,可以降低它们在相邻元素上运行的可能性,从而减少错误共享的机会.您可以通过在数据元素上添加额外的填充来轻松地对此进行测试,以使它们各自的大小至少为64个字节(典型的缓存行大小).如果您的4线程代码加快了速度,那就是问题所在.

Another possible cause of the speed up is that your threads are experiencing false sharing. If you have two threads writing data to different values on the same cache line (e.g. adjacent elements of an array) then this will block the CPU whilst the cache line is transferred back and forth. By adding more threads you decrease the likelihood that they are operating on adjacent elements, and thus reduce the chance of false sharing. You can easily test this by adding extra padding to your data elements so they are each at least 64 bytes in size (the typical cache line size). If your 4-thread code speeds up, this was the problem.

这篇关于为什么拥有比核心更快的线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆