使用OpenMP的块,打破缓存 [英] Use of OpenMP chunk to break cache

查看:153
本文介绍了使用OpenMP的块,打破缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力提高我的OpenMP解决方案的性能往往要处理的阵列嵌套循环。虽然我已经成功地把它归结为37从串行实现59秒(在一个老龄化的双核英特尔​​T6600)我担心缓存同步获取大量的CPU注意(当CPU要解决我的问题! )。我一直在争取建立探查所以我还没有证实这种说法,但我的问题,无论站立。据这个讲座负载平衡:


  

而不是做的工作,该处理器是多程序中唯一使用的高速缓存线占线战斗。你可以用很奇怪的方法解决这个问题:在内存除了继续移动CPU的数据比一个高速缓存行。例如,下面我们进入每个线程除了20台访问的整数。


和然后继续提供相关的源$ C ​​$ C(应该在四核运行因此%4

 的#pragma OMP并行的时间表(静态,1)
    为(unsigned int类型I = 0; I< N;我++){
    ARR [(I%4)* 20] ++;
}

这是说,我有什么块不过是上面的实现似乎完全忽略它,导致我相信我的直觉是错的一种直觉。

我的问题是这样的:是否进一步制定一个合理的大块值移动数据向下的高速缓存行? IE浏览器。不会在code以上等同于

 的#pragma OMP并行的时间表(静态,20)
    为(unsigned int类型I = 0; I< N;我++){
    改编[I] ++;
}


解决方案

这是你给了两个code段是不等价作为第一个将保持在重申对 n个相同的元素 4比处理这些阵列的正确方法是较大的,以确保的sizeof(ARR [0])* N / #cores 是一个多的高速缓存行的大小。现代的x86 CPU的64个字节缓存行的大小。如果改编是一个整数或单precision浮点数组,那么的sizeof(ARR [0])== 4 和一个高速缓存行拥有16个元素。对于数据类型与大小的两倍一个高速缓存行持有8个元素。最好循环调度块大小是则在后一种情况下16在前者的情况下的倍数或8

当使用静态调度循环处理人们试图以减少由每个线程中运行的循环数以最大化块大小。例如,如果有4个线程, N 为64,块大小被设定为8,以下时间表将被使用:

 线程的迭代(从到)
------ --------------------
  0 0-7,32-39
  1月8日至一十五日,40-47
  2月16日至23日,48-53
  3月24日至31日,54-63

这是很不理想,因为每个线程具有运行循环两次。一个更好的解决办法是有块大小设置为16(这是8的倍数):

 线程的迭代(从到)
------ --------------------
  0 0-15
  1月16日至31日
  2 32-47
  3 48-63

请注意,对于静态调度循环默认块大小为 #iterations / #threads

有时候一个具有并行数据不能为s $非重叠的缓存行中p $垫来处理。例如改编[] 可能是简单的4元素都适合到一个单一的高速缓存行的数组。在这种情况下,应该插入阵列元件之间填充以确保数据被不同线程进程在于不同的高速缓存行。例如:

  INT ARR [4];OMP的#pragma为平行
的for(int i = 0;我4;;我++)
   改编[I] ++;

INT ARR [4] 结果如下内存布局:

  |< --------单个缓存线----------> |
| ARR [0] |改编[1] | ARR [2] |改编[3] | ... |

如果核心0更新改编[0] 和核心更新1 改编[1] ,那么缓存行假共享和坏的表现 - 会不断地在两个内核之间反弹。因此,一个人之间插入改编填充[0] 改编[1] 尺寸 CLS - 的sizeof(ARR [0])字节或 CLS /的sizeof(ARR [0]) - 1 数组元素,其中 CLS 是字节高速缓存线的大小。随着 CLS == 64 的sizeof(ARR [0])== 4 这使得填充的15个元素。由此产生的布局将是:

  |< -----一个高速缓存行------> |< ---另一个缓存线----> |<  - - 完后还有 ...
| ARR [0] | 15未使用的元素|改编[1] | 15未使用的元素| ARR [2] | ...

剪断code应修改为这样的:

  //缓存行大小INT元素个数
CLS的#define(64 /的sizeof(int)的)INT ARR [4 * CLS]。OMP的#pragma为平行
的for(int i = 0;我4;;我++)
   改编[I * CLS] ++;

这会以某种方式简化code另一个选择是将包裹每个数据元素中的结构,并把填充结构内:

以字节数

  //缓存行大小
CLS的#define(64)typedef结构_item
{
   int数据;
   INT填充[CLS /的sizeof(INT)-1];
}项目;项ARR [4];OMP的#pragma为平行
的for(int i = 0;我4;;我++)
   改编[I]。数据++;

无论哪种方法你使用,请记住,此类code变得不可移植的各种架构有不同的缓存行的大小。

I've been trying to increase the performance of my OpenMP solution which often has to deal with nested loops on arrays. Although I've managed to bring it down to 37 from 59 seconds of the serial implementation (on an ageing dual-core Intel T6600) I'm worried that cache synch gets lots of CPU attention (when the CPU should be solving my problem!). I've been fighting to set up the profiler so I haven't verified that claim but my question stands regardless. According to this lecture on load balancing:

Instead of doing work, the CPUs are busy fighting over the only used cache line in the program. You can fix this with a very strange technique: move the CPUs data apart in memory further than one cache line. For example, here we move the integers accessed by each thread 20 units apart.

and then goes on to provide the relevant source code (supposed to run on a quad-core hence the %4)

#pragma omp parallel for schedule(static,1)
    for (unsigned int i=0;i<n;i++) {
    arr[(i%4)*20]++;
}

That said, I have an intuition about what the "chunk" is but the above implementation seems to completely ignore it, leading me to believe my intuition is wrong.

My question is this: Does setting a reasonably large chunk value move data further down the cache lines? Ie. wouldn't the code above be equivalent to

#pragma omp parallel for schedule(static, 20)
    for (unsigned int i=0;i<n;i++) {
    arr[i]++;
}

解决方案

Both code snippets that you've given are not equivalent as the first one would keep reiterating over the same elements for n larger than 4. The proper way to process such arrays is to ensure that sizeof(arr[0]) * n / #cores is a multiple of the cache line size. Modern x86 CPUs have cache line size of 64 bytes. If arr is an integer or a single-precision floating point array, then sizeof(arr[0]) == 4 and a single cache line holds 16 elements. For data types with double the size a single cache line holds 8 elements. The best loop schedule chunk size is then a multiple of 16 in the former case or 8 in the latter case.

When dealing with statically scheduled loops one tries to maximise the chunk size in order to reduce the number of loops run by each thread. For example, if there are 4 threads, n is 64 and the chunk size is set to 8, the following schedule will be used:

thread    iterations (from-to)
------    --------------------
  0          0- 7, 32-39
  1          8-15, 40-47
  2         16-23, 48-53
  3         24-31, 54-63

This is far from optimal since each thread has to run the loop two times. A much better solution would be to have the chunk size set to 16 (which is a multiple of 8):

thread    iterations (from-to)
------    --------------------
  0             0-15
  1            16-31
  2            32-47
  3            48-63

Note that the default chunk size for statically scheduled loops is #iterations / #threads.

Sometimes one has to process in parallel data that cannot be spread among non-overlapping cache lines. For example arr[] might be simply an array of 4 elements that all fit into a single cache line. In this case one should insert padding between array elements in order to make sure that data being processes by different threads lie in different cache lines. For example:

int arr[4];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i]++;

int arr[4] results in the following memory layout:

|<-------- a single cache line ---------->|
| arr[0] | arr[1] | arr[2] | arr[3] | ... |

If core 0 updates arr[0] and core 1 updates arr[1], then the cache line would constantly bounce between the two cores - false sharing and bad performance. Therefore one has to insert padding between arr[0] and arr[1] of size CLS - sizeof(arr[0]) bytes or CLS/sizeof(arr[0]) - 1 array elements, where CLS is the size of the cache line in bytes. With CLS == 64 and sizeof(arr[0]) == 4 this makes 15 elements of padding. The resulting layout would be:

|<----- one cache line ------>|<--- another cache line ---->|<-- yet another ...
| arr[0] | 15 unused elements | arr[1] | 15 unused elements | arr[2] | ...

The code snipped should be modified as such:

// cache line size in number of int elements
#define CLS    (64/sizeof(int))

int arr[4*CLS];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i*CLS]++;

Another option that would somehow simplify the code would be to wrap each data element in a structure and put the padding inside the structure:

// cache line size in number of bytes
#define CLS    (64)

typedef struct _item
{
   int data;
   int padding[CLS/sizeof(int)-1];
} item;

item arr[4];

#pragma omp parallel for
for (int i = 0; i < 4; i++)
   arr[i].data++;

No matter which method you use, bear in mind that such code becomes non-portable as various architectures have differing cache line sizes.

这篇关于使用OpenMP的块,打破缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆