OpenMP代码停滞 [英] OpenMP code stalling

查看:82
本文介绍了OpenMP代码停滞的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,



我有一个计算机视觉算法,其中卷积算法必须在使用openMP的多核处理器上并行运行。代码的格式如下所示。



Hello everyone,

I have a computer vision algorithm in which the convolution algorithm must be run in parallel on multicore processors using openMP. The code is of the form as shown below.

void convolve(Bitmap &src,Kernel &kernel, Bitmap &dst)
   #pragma omp parallel for
   for(int y = 0; y < height; ++y){
      for(int x = 0; x <width; ++x){

          kernel.response(src,x,y,dst);
      }
   }





内核是一个界面





The kernel is an interface

class Kernel {

public: virtual ~Kernel();

public: virtual void response(Bitmap &src, int x, int y, Bitmap &dst) = 0;

};





问题是编译器不知道内核的实现,它们可以是复杂。那么这个代码是否能够使用openMP进行并行化?



如果我运行代码它实际上比串行版本运行得慢,并且在运行实时图像识别时它明显停止注意:禁用openMP时不是这种情况。我使用visual studio express 2013启用了/ openmp



编辑:



以下示例显示了将位图更改为灰度的算法的内核实现。与大多数计算机视觉算法一样,卷积运算符本质上是并行问题,但是将这种算法与共享存储器并行化很困难,尤其是由于竞争条件。也许我应该考虑使用GPU来实现我的算法的硬件加速注意:即使在单核上,代码也能实时运行,但我想为移动平台加速,因为代码要在移动设备上运行增强现实应用程序,因此多核是可行的方式。



大多数情况下,如果我使用omp parallel for shared(dst)代码运行速度很快,但停顿时间最短但是我感觉竞争条件仍然存在,是否有硬件实现,以避免错误的共享和竞争条件,而不使用昂贵的关键?我试过原子但它只适用于诸如加法操作之类的原语。为什么原子不支持赋值运算符?

我最近刚刚遇到openMP并对此感到很兴奋,直到这些问题出现:-(





The problem is that implementations of kernel are not known by the compiler and they can be complex. So is this code capable of being parallelised using openMP?

If I run the code it actually runs slower than the serial version and it visibly stalls when running real time image recognition NOTE: this is not the case when openMP is disabled. I'am using visual studio express 2013 with "/openmp" enabled



Example below shows a Kernel implementation for an algorithm to change bitmap to gray scale. The convolution operator is inherently a parallel problem as is most computer vision algorithms, yet parallelizing such algorithms with shared memory is hard, especially due to race conditions. Maybe I should consider using GPU's for doing hardware acceleration of my algorithms NOTE: the code runs in real - time even on a single core, but I wanted to speed things up for mobile platforms, as the code is to be run on mobile devices for Augment Reality apps, thus multicore is the way to go.

Mostly if I use "omp parallel for shared(dst)" the code runs fast with minimal stalling but I feel like the race conditions are still there, is there a hardware implementation for avoiding false sharing and race conditions without using the expensive "critical"? I tried "atomic" but it's only for primitives such as addition operations. And why assignment operator is not supported by "atomic"?
I also just recently came across openMP and was excited about it until these issues came around :-(

// Kernel for changing bitmap to gray scale
class GrayScale: public Kernel {
 
public: void response(Bitmap &src, int x, int y, Bitmap &dst)
{
Pixel &sp = src.pData[x + y*src.width];
 
Pixel &dp = dst.pData[x + y*dst.width];
 
unsigned char gray = static_cast<unsigned char>(0.3*sp.red + 0.5*sp.green + 0.2*sp.blue);
 
#pragma omp critical
{
dp.red = dp.green = dp.blue = gray;
}
}
}; 





where





where

// Pixel data for ARGB_8888 bitmap format
struct Pixel {
unsigned char red;
unsigned char green;
unsigned char blue;
unsigned char alpha;
};

推荐答案

你的for循环可能太精细了。



尝试将工作分成几块。



Your for loop is probably too fine grained.

Try breaking the work up into chunks.

void convolve_inner(Bitmap &src,Kernel &kernel, Bitmap &dst, int low, int high)
{
   // process one stripe or band of image data.
   for(int y = low; y < high; ++y)
   {
      for(int x = 0; x < width; ++x)
      {
         kernel.response(src, x, y, dst);
      }
   }
}

void convolve(Bitmap &src, Kernel &kernel, Bitmap &dst)
{
   int chunk = (width + 7) / 8;
   #pragma omp parallel for
   for(int low = 0; low < width; low += chunk)
   {
      int high = low + chunk;
      if (high > width) high = width;
      convolve_inner(src, kernel, dst, low, high);
   }
}


这篇关于OpenMP代码停滞的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆