几个算术运算在C ++ Amp中解析 [英] Several arithmetic operations pararellized in C++Amp

查看:224
本文介绍了几个算术运算在C ++ Amp中解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用C ++ Amp并行化卷积过滤器。我想要以下函数开始工作(我不知道如何正确地做它):

  float * pixel_color [ ] = new float [16]; 

concurrency :: array_view< float,2>像素(4,4,pixel_array),抽头(4,4,myTap4Kernel_array);
concurrency :: array_view< float,1>像素(16,pixel_color); //我不知道这里使用哪个数据结构

parallel_for_each(
pixels.extent,[=](concurrency :: index< 2> idx)restrict $ b {
int row = idx [0];
int col = idx [1];

像素(row,col)= taps (row,col);
pixel [0] + = pixels(row,col);
});
pixel_color.synchronize();

pixels_.at< Pixel>(j,i)= pixel_color

}



主要的问题是我不知道如何正确使用像素结构(并发数据结构不需要所有16个元素)。我不知道我可以安全地添加值这种方式。
下面的代码不工作,它不会添加适当的值到像素[0]。
我也想定义

  concurrency :: array_view< float,2>像素(4,4,pixel_array),抽头(4,4,myTap4Kernel_array); (例如在头文件中)的

,并在构造函数或其他函数因为这是一个瓶颈,需要花费大量的时间在CPU和GPU之间复制数据)。有人知道如何这样做吗?

解决方案

你不是正确的轨道,但是在GPU上对数组的操作是棘手的,因为你不能保证顺序其中不同的元素被更新。



这里有一个非常相似的例子。 ApplyColorSimplifierTiledHelper 方法包含一个限制AMP的parallel_for_each,它为2D数组中的每个索引调用 SimplifyIndexTiled SimplifyIndexTiled 基于 destFrame 中对应像素周围的像素值计算 c $ c> srcFrame 。这解决了代码中存在的竞争条件问题。



此代码来自 C ++ AMP图书的Codeplex网站。 Cartoonizer案例研究包括在C ++ AMP中实现的这些图像处理问题的几个例子;数组,纹理,平铺/直立和多GPU。 C ++ AMP图书会详细讨论实施方式。



<$ c $ p> void ApplyColorSimplifierTiledHelper(const array< ArgbPackedPixel,2& srcFrame,
array< ArgbPackedPixel,2& destFrame,UINT neighborWindow)
{
const float_3 W(ImageUtils :: W);

assert(neighborWindow< = FrameProcessorAmp :: MaxNeighborWindow);

tiled_extent< FrameProcessorAmp :: TileSize,FrameProcessorAmp :: TileSize>
computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain,[=,& srcFrame,& destFrame]
(tiled_index< FrameProcessorAmp :: TileSize,FrameProcessorAmp :: TileSize> idx)
restrict(amp)
{
SimplifyIndexTiled(srcFrame,destFrame,idx,neighborWindow,W);
});
}

void SimplifyIndex(const array< ArgbPackedPixel,2& srcFrame,array< ArgbPackedPixel,
2>& destFrame,index 2 idx,
UINT neighborWindow,const float_3& W)restrict(amp)
{
const int shift = neighborWindow / 2;
float sum = 0;
float_3 partialSum;
const float standardDeviation = 0.025f;
const float k = -0.5f /(standardDeviation * standardDeviation);

const int idxY = idx [0] + shift; //修正边框偏移的索引。
const int idxX = idx [1] + shift;
const int y_start = idxY - shift;
const int y_end = idxY + shift;
const int x_start = idxX - shift;
const int x_end = idxX + shift;

RgbPixel orgClr = UnpackPixel(srcFrame(idxY,idxX));

for(int y = y_start; y <= y_end; ++ y)
for(int x = x_start; x <= x_end; ++ x)
{
if(x!= idxX || y!= idxY)//不对所请求的索引应用过滤器,只适用于邻居
{
RgbPixel clr = UnpackPixel(srcFrame y,x));
float distance = ImageUtils :: GetDistance(orgClr,clr,W);
float value = concurrency :: fast_math :: pow(float(M_E),k * distance * distance);
sum + = value;
partialSum.r + = clr.r * value;
partialSum.g + = clr.g * value;
partialSum.b + = clr.b * value;
}
}

RgbPixel newClr;
newClr.r = static_cast< UINT>(clamp(partialSum.r / sum,0.0f,255.0f));
newClr.g = static_cast< UINT>(clamp(partialSum.g / sum,0.0f,255.0f));
newClr.b = static_cast< UINT>(clamp(partialSum.b / sum,0.0f,255.0f));
destFrame(idxY,idxX)= PackPixel(newClr);
}

代码使用 ArgbPackedPixel ,因为C ++ AMP不支持 char 这是一个将8位RGB值打包到 unsigned long >。如果你的问题足够小,以适应纹理,那么你可能想看看使用这个而不是一个数组,因为pack / unpack是在GPU上的硬件实现的,所以是有效的免费,这里你必须支付与额外的计算。还有一个在CodePlex上实现的示例。

  typedef unsigned long ArgbPackedPixel; 

struct RgbPixel
{
unsigned int r;
unsigned int g;
unsigned int b;
};

const int fixedAlpha = 0xFF;

inline ArgbPackedPixel PackPixel(const RgbPixel& rgb)restrict(amp)
{
return(rgb.b |(rgb.g << 8)| r <16)|(fixedAlpha<< 24));
}


inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb)restrict(amp)
{
RgbPixel rgb;
rgb.b = packedArgb& 0xFF;
rgb.g =(packedArgb& 0xFF00)> 8;
rgb.r =(packedArgb& 0xFF0000)>> 16;
return rgb;
}


I am trying to parallelize a convolution filter using C++Amp. I would like the following function to start working (I don't know how to do it properly):

float* pixel_color[] = new float [16]; 

concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array); 
concurrency::array_view<float, 1> pixel(16, pixel_color); // I don't know which data structure to use here

parallel_for_each(
      pixels.extent, [=](concurrency::index<2> idx) restrict(amp)
  {
      int row=idx[0];
      int col=idx[1];

      pixels(row, col) = taps(row, col) * pixels(row, col); 
      pixel[0] += pixels(row, col); 
     });
pixel_color.synchronize(); 

pixels_.at<Pixel>(j, i) = pixel_color 

}

The main problem is that I don't know how to use the pixel structure properly (which concurrent data structure to use here as I don't need all 16 elements). And I don't know if I can safely add the values this way. The following code doesn't work, it does not add appropriate values to pixel[0]. I also would like to define

concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array); 

outside the method (for example in the header file) and initialize it in the costructor or other function (as this is a bottle-neck and takes a lot of time copying the data between CPU and GPU). Does anybody know how to do this?

解决方案

You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.

Here's an example of something very similar. The ApplyColorSimplifierTiledHelper method contains an AMP restricted parallel_for_each that calls SimplifyIndexTiled for each index in the 2D array. SimplifyIndexTiled calculates a new value for each pixel in destFrame based on the value of the pixels surrounding the corresponding pixel in srcFrame. This solves the race condition issue present in your code.

This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.

void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
    array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
    const float_3 W(ImageUtils::W);

    assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);

    tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>     
        computeDomain = GetTiledExtent(srcFrame.extent);
    parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
        (tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx) 
        restrict(amp)
    {
        SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
    });
}

void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
                   2>& destFrame, index<2> idx, 
                   UINT neighborWindow, const float_3& W) restrict(amp)
{
    const int shift = neighborWindow / 2;
    float sum = 0;
    float_3 partialSum;
    const float standardDeviation = 0.025f;
    const float k = -0.5f / (standardDeviation * standardDeviation);

    const int idxY = idx[0] + shift;         // Corrected index for border offset.
    const int idxX = idx[1] + shift;
    const int y_start = idxY - shift;
    const int y_end = idxY + shift;
    const int x_start = idxX - shift;
    const int x_end = idxX + shift;

    RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));

    for (int y = y_start; y <= y_end; ++y)
        for (int x = x_start; x <= x_end; ++x)
        {
            if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
            {
                RgbPixel clr = UnpackPixel(srcFrame(y, x));
                float distance = ImageUtils::GetDistance(orgClr, clr, W);
                float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
                sum += value;
                partialSum.r += clr.r * value;
                partialSum.g += clr.g * value;
                partialSum.b += clr.b * value;
            }
        }

    RgbPixel newClr;
    newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
    newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
    newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
    destFrame(idxY, idxX) = PackPixel(newClr);
}

The code uses ArgbPackedPixel, which is simply a mechanism for packing 8-bit RGB values into an unsigned long as C++ AMP does not support char. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.

typedef unsigned long ArgbPackedPixel;

struct RgbPixel 
{
    unsigned int r;
    unsigned int g;
    unsigned int b;
};

const int fixedAlpha = 0xFF;

inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp) 
{
    return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}


inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp) 
{
    RgbPixel rgb;
    rgb.b = packedArgb & 0xFF;
    rgb.g = (packedArgb & 0xFF00) >> 8;
    rgb.r = (packedArgb & 0xFF0000) >> 16;
    return rgb;
}

这篇关于几个算术运算在C ++ Amp中解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆