多线程环路的效率 [英] Efficiency of Multithreaded Loops

查看:153
本文介绍了多线程环路的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问候贵族社群,



我想要有以下循环:

  for(i = 0; i  A [i] = B [i] + C [i] 

这将在使用线程的共享内存四核计算机上并行运行。下面的两个替代方法被考虑用于由这些线程执行的代码,其中 tid 是线程的id:0,1,2或3。 >

(为简单起见,假设 MAX 是4的倍数)



< h2>选项1:

  for(i = tid; i  [i] = B [i] + C [i]; 



选项2:



  for(i = tid *(MAX / 4); i <(tid + 1)*(MAX / 4); i ++)
A [i] = B [i] + C [一世];

我的问题是如果有一个比另一个更有效,为什么?

解决方案

第二个比第一个更好。简单的答案:第二个最小化误分享



现代CPU不会将字节逐个加载到高速缓存。它在称为缓存行的批处理中读取一次。当两个线程尝试修改同一缓存行上的不同变量时,必须在修改缓存之后重新加载缓存。



何时会发生这种情况?



基本上,内存中附近的元素将在同一缓存行中。所以,数组中的相邻元素将在同一个缓存行中,因为数组只是一块内存。 foo1和foo2可能在同一个缓存行中,因为它们在同一个类中被定义得很近。

  class Foo {

private int foo1;
private int foo2;

}

假分享有多糟糕?



我参考图库中的示例6的处理器缓存效果


  private static int [] s_counter = new int [1024 ]; 
private void UpdateCounter(int position)
{
for(int j = 0; j< 100000000; j ++)
{
s_counter [position] = s_counter [位置] + 3;
}
}



在我的四核机器上,如果我调用UpdateCounter使用来自四个不同线程的参数0,1,2,3,将花费4.3秒直到所有线程完成。
另一方面,如果我使用参数16,32,48,64调用UpdateCounter,操作将在0.28秒内完成!


如何检测错误共享?



Linux Perf可用于检测缓存缺失,因此可帮助您分析此类问题。 / p>

指的是来自 CPU缓存的分析Effects和Linux Perf ,使用perf从上面几乎相同的代码示例中找出L1缓存未命中:


 './cache_line_test 0 1 2 3'的性能计数器统计信息:
10,055,747 L1-dcache-load-misses#所有L1-dcache命中的1.54%[51.24%]




 性能计数器统计'./cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses#所有L1-dcache命中的0.01%[50.51%]

这里显示总共L1缓存命中将从10,055,747下降到36,992没有假共享。而且性能开销不在这里,它在系列加载L2,L3缓存,加载内存后错误共享。



在行业中有一些好的做法?



LMAX Disruptor 是一个高性能线程间消息传递库,它是 Apache Storm 中工作内部通信的默认消息传递系统。
底层数据结构是一个简单的环形缓冲区。但为了使它快速,它使用了很多技巧,以减少假共享。



例如,它定义了超类 RingBufferPad 在RingBuffer的元素之间创建pad:

  abstract class RingBufferPad 
{
protected long p1,p2,p3,p4,p5,p6,p7;
}

此外,当为缓冲区分配内存时,以便它不会受到相邻存储空间中的数据的影响:

  this.entries = new Object [sequencer。 getBufferSize()+ 2 * BUFFER_PAD];  master / src / main / java / com / lmax / disruptor / RingBuffer.java#L76>源 



您可能想要了解有关所有魔术技法。看看作者的一篇文章:剖析破坏者:为什么这么快


Greetings noble community,

I want to have the following loop:

for(i = 0; i < MAX; i++)
    A[i] = B[i] + C[i];

This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.

(for simplicity, assume MAX is a multiple of 4)

Option 1:

for(i = tid; i < MAX; i += 4)
    A[i] = B[i] + C[i];

Option 2:

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
    A[i] = B[i] + C[i];

My question is if there's one that is more efficient then the other and why?

解决方案

The second one is better than the first one. Simple answer: the second one minimize false sharing

Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.

When would this happen?

Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.

class Foo {

private int foo1;
private int foo2;

}

How bad is false sharing?

I refer Example 6 from the Gallery of Processor Cache Effects

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done. On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!

How to detect false sharing?

Linux Perf could be used to detect cache misses and therefore help you analysis such problem.

refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:

Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses     #    1.54% of all L1-dcache hits   [51.24%]

Performance counter stats for './cache_line_test 16 32 48 64':
  36,992 L1-dcache-load-misses     #    0.01% of all L1-dcache hits   [50.51%]

It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.

Is there some good practice in industry?

LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.

For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:

this.entries   = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];

source

You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

这篇关于多线程环路的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆