OpenMP:堆阵列的性能不佳(堆阵列工作正常) [英] OpenMP: poor performance of heap arrays (stack arrays work fine)

查看:64
本文介绍了OpenMP:堆阵列的性能不佳(堆阵列工作正常)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一位经验丰富的OpenMP用户,但我遇到了一个令人费解的问题,我希望这里有人可以提供帮助.问题在于,简单的哈希算法在堆栈分配的数组上表现良好,而在堆上的数组上表现不佳.

I am a fairly experienced OpenMP user, but I have just run into a puzzling problem, and I am hopeful that someone here could help. The problem is that a simple hashing algorithm performs well for stack-allocated arrays, but poorly for arrays on the heap.

下面的示例使用i%M(i模数M)对相应数组元素中的每个第M个整数进行计数.为简单起见,假设N = 1000000,M = 10.如果N%M == 0,那么结果应该是bins []的每个元素都等于N/M:

Example below uses i%M (i modulus M) to count every M-th integer in respective array element. For simplicity, imagine N=1000000, M=10. If N%M==0, then the result should be that every element of bins[] is equal to N/M:

#pragma omp for
  for (int i=0; i<N; i++) 
    bins[ i%M ]++;

Array bins []对每个线程都是私有的(此后,我对关键部分中所有线程的结果求和).

Array bins[] is private to each thread (I sum results of all threads in a critical section afterwards).

在堆栈上分配bins []时,程序运行良好,性能与内核数成比例.

When bins[] is allocated on the stack, the program works great, with performance scaling proportionally to the number of cores.

但是,如果bins []在堆上(bins []的指针在堆栈上),性能将急剧下降.这是一个主要问题!

However, if bins[] is on the heap (pointer to bins[] is on the stack), performance drops drastically. And that is a major problem!

我想使用OpenMP将某些数据的合并(散列)到堆阵列中,这对性能造成了很大的影响.

I want parallelize binning (hashing) of certain data into heap arrays with OpenMP, and this is a major performance hit.

绝对不像所有试图写入同一内​​存区域的线程那样愚蠢. 这是因为每个线程都有自己的bins []数组,使用堆分配的堆栈和堆栈分配的bin的结果都是正确的,并且单线程运行的性能没有差异. 我使用GCC和Intel C ++编译器在不同的硬件(Intel Xeon和AMD Opteron)上重现了该问题.所有测试均在Linux(Ubuntu和RedHat)上进行.

It is definitely not something silly like all threads trying to write into the same area of memory. That is because each thread has its own bins[] array, results are correct with both heap- and stack-allocated bins, and there is no difference in performance for single-thread runs. I reproduced the problem on different hardware (Intel Xeon and AMD Opteron), with GCC and Intel C++ compilers. All tests were on Linux (Ubuntu and RedHat).

似乎没有理由将OpenMP的良好性能限制在堆栈数组上.

There seems no reason why good performance of OpenMP should be limited to stack arrays.

有没有猜到?也许对堆的线程访问通过Linux上的某种共享网关进行?我该如何解决?

Any guesses? Maybe access of threads to the heap goes through some kind of shared gateway on Linux? How do I fix that?

完整的程序如下:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main(const int argc, const char* argv[])
{
  const int N=1024*1024*1024;
  const int M=4;
  double t1, t2;
  int checksum=0;

  printf("OpenMP threads: %d\n", omp_get_max_threads());

  //////////////////////////////////////////////////////////////////
  // Case 1: stack-allocated array
  t1=omp_get_wtime();
  checksum=0;
#pragma omp parallel
  { // Each openmp thread should have a private copy of 
    // bins_thread_stack on the stack:
    int bins_thread_stack[M];
    for (int j=0; j<M; j++) bins_thread_stack[j]=0;
#pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_stack[j]++;
      }
#pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_stack[j];
  }
  t2=omp_get_wtime();
  printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  //////////////////////////////////////////////////////////////////
  // Case 2: heap-allocated array
  t1=omp_get_wtime();
  checksum=0;
  #pragma omp parallel 
  { // Each openmp thread should have a private copy of 
    // bins_thread_heap on the heap:
    int* bins_thread_heap=(int*)malloc(sizeof(int)*M); 
    for (int j=0; j<M; j++) bins_thread_heap[j]=0;
  #pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_heap[j]++;
      }
  #pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_heap[j];
    free(bins_thread_heap);
  }
  t2=omp_get_wtime();
  printf("Time with heap  array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  return 0;
}

该程序的示例输出如下:

Sample outputs of the program are below:

对于OMP_NUM_THREADS = 1

for OMP_NUM_THREADS=1

OpenMP threads: 1
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 3.091 sec, checksum=1073741824 (must be 1073741824).

,且OMP_NUM_THREADS = 10

and for OMP_NUM_THREADS=10

OpenMP threads: 10
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 2.150 sec, checksum=1073741824 (must be 1073741824).

非常感谢您的帮助!

推荐答案

这是一个可爱的问题:使用上面得到的代码(gcc4.4,Intel i7),我得到了4个线程

This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get

OpenMP threads: 4
Time with stack array:        1.696 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        5.413 sec, checksum=1073741824 (must be 1073741824).

但是如果我将malloc行更改为

but if I change the malloc line to

    int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);

(更新:甚至

    int* bins_thread_heap=(int*)malloc(sizeof(int)*16);

)

然后我得到

OpenMP threads: 4
Time with stack array:        1.578 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        1.574 sec, checksum=1073741824 (must be 1073741824).

这里的问题是虚假共享.默认的malloc非常(空间)高效,并将请求的小分配全部放在一个内存块中,彼此相邻;但是由于分配量很小,以至于多个都适合同一条高速缓存行,这意味着每次一个线程更新其值时,都会弄脏相邻线程中的值的高速缓存行.通过使请求的内存足够大,这不再是问题.

The problem here is false sharing. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.

顺便说一句,应该清楚为什么堆栈分配的情况看不到此问题;不同的线程-不同的堆栈-足够大的内存使得错误共享不会成为问题.

Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads - different stacks - memory far enough appart that false sharing isn't an issue.

另一方面,对于这里使用的M大小而言,这并不重要,但是如果M(或线程数)较大,则omp关键将是一个很大的串行瓶颈;您可以使用 OpenMP缩减来更有效地对校验和求和

As a side point -- it doesn't really matter for M of the size you're using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use OpenMP reductions to sum the checksum more efficiently

#pragma omp parallel reduction(+:checksum)
    { // Each openmp thread should have a private copy of 
        // bins_thread_heap on the heap:
        int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
        for (int j=0; j<M; j++) bins_thread_heap[j]=0;
#pragma omp for
        for (int i=0; i<N; i++)
        { // Accumulating every M-th number in respective array element
            const int j=i%M;
            bins_thread_heap[j]++;
        }
        for (int j=0; j<M; j++)
            checksum+=bins_thread_heap[j];
        free(bins_thread_heap);
 }

这篇关于OpenMP:堆阵列的性能不佳(堆阵列工作正常)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆