你如何通过CUDA投遍历数组? [英] How do you iterate through a pitched CUDA array?

查看:440
本文介绍了你如何通过CUDA投遍历数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说完并行使用OpenMP之前,我想换我的头周围CUDA,这似乎不太直观的给我。在这一点上,我试图通过一个并行方式的数组确切地了解如何循环。

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.

Cuda的由示例是一个很好的开始。

43页显示片段:

__global__ void add( int *a, int *b, int *c ) {
  int tid = blockIdx.x; // handle the data at this index
  if (tid < N)
     c[tid] = a[tid] + b[tid];
  }

而在OpenMP的程序员选择循环运行的次数和OpenMP分裂了成你的线程,在CUDA你必须告诉它(通过模块和线程数的数 &LT;&LT;&LT; ...&GT;&GT;&GT; )来运行它足够的时间通过您的数组进行迭代,使用线程ID号作为迭代器。换句话说,你可以有一个CUDA内核始终运行10000次,这意味着上述code将任何阵列工作到N = 10,000(当然更小数组你在浪费周期辍学的如果(TID&LT; N)。

Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).

有关投内存(2D和3D阵列),CUDA编程指南有下面的例子:

For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:

// Host code
int width = 64, height = 64; 
float* devPtr; size_t pitch; 
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);

MyKernel<<<100, 512>>>(devPtr, pitch, width, height); 

// Device code 
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height) 
{ 
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch); 
        for (int c = 0; c > width; ++c) { 
            float element = row[c]; 
        }
    }
}

这个例子似乎并没有太大的用处我。他们首先声明一个数组,它是64×64,那么内核设置为执行512×100倍。这很好,因为内核确实没有什么比其他迭代通过数组(所以它通过一个64×64阵列上运行51,200循环)。

This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).

根据这个答案迭代器,当有事情将线程块

According to this answer the iterator for when there are blocks of threads going on will be

int tid = (blockIdx.x * blockDim.x) + threadIdx.x;

所以,如果我想运行我的问题的第一个片段的投阵,我可以确保我有足够的块和线程涵盖每一个元素包括我不在乎填充。但是,这似乎是一种浪费。

So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.

所以我怎么通过投遍历数组不通过填充元素回事?

在我的具体应用我有一个2D FFT和我试图计算的幅度和角度的阵列(在GPU上,以节省时间)。

In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).

推荐答案

从审查的杰克南瓜灯宝贵意见和答复后,又重新阅读文档,我能得到我的头直。答案当然是小巫见大巫现在我明白了。

After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.

在下面的code,我定义 CFPtype (复杂浮点)和 FPtype 让我可以单双precision之间迅速改变。例如,的#define CFPtype cufftComplex

In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.

我还是不能换我的头周围用来调用内核线程数。如果它过大,根本不会进入功能在所有。文档似乎并没有说什么,应使用什么号码 - 但是这一切都是为了一个单独的问题

I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.

在得到我的整个程序协同工作(2D FFT上投内存和计算规模和参数)的关键是认识到即使CUDA为您提供了大量的在分配2D和3D阵列明显的帮助,一切都还在单位中的字节的。很显然在malloc调用了的sizeof(类型)必须包括在内,但我完全错过了它在类型拨出的电话(宽度,高度)。小白的错误,我猜。如果我写的图书馆中,我会作出尺寸型单独的参数,但不管。

The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.

因此​​,考虑尺寸宽度×高度像素的图像,这是怎么走到一起:

So given an image of dimensions width x height in pixels, this is how it comes together:

分配内存

我使用的是在主机端的固定内存,因为它应该会更快。这是分配了 cudaHostAlloc 这很简单。对于投的内存,你需要在球场存储每个不同的宽度和类型,因为它可能会改变。在我的情况的尺寸都是相同的(复杂,复杂的变换),但我有一个是实数,所以我存储数组a complexPitch realPitch 。该投内存是这样做的:

I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:

cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);

要从投数组,你不能使用 cudaMemcpy 复制内存/。

To copy memory to/from pitched arrays you cannot use cudaMemcpy.

cudaMemcpy2D(inputGPU, complexPitch,  //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);

有关投阵列的 FFT计划

这个答案,我不可能做到人无。在我的情况该计划是这样的:

JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:

int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
    &plan, 
    2, n, //transform rank and dimensions
    nembed, 1, //input array physical dimensions and stride
    1, //input distance to next batch (irrelevant because we are only doing 1)
    nembed, 1, //output array physical dimensions and stride
    1, //output distance to next batch
    cufftType::CUFFT_C2C, 1);

执行FFT是微不足道的:

Executing the FFT is trivial:

cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);

到目前为止,我已经没有什么优化。现在,我希望得到的幅度和相位出的变换,如何遍历并行阵列投,因此这个问题。首先,我定义一个函数来调用每块正确的线程和足够的块内核来覆盖整个图像。正如文件提出,创建2D结构的这些数字是有很大的帮助。

So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.

void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
    dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
    dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));

    CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}

每块设置块和线程相当于写嵌套的的(最多3) -loops。所以,你必须有足够的块*线程覆盖数组,然后在内核中,你必须确保你不超过数组的大小。通过使用2D元素为 threadsPerBlock numBlocks ,您就不必去通过阵列中的填充元素。

Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.

遍历在投阵列并行

内核使用标准的指针算法从文档:

The kernel uses the standard pointer arithmetic from the documentation:

__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
                                   FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
    int threadX = threadIdx.x + blockDim.x * blockIdx.x;
    if (threadX >= width) 
        return;

    int threadY = threadIdx.y + blockDim.y * blockIdx.y;
    if (threadY >= height)
        return;

    CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
    CFPtype complex = threadRow[threadX];

    FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
    FPtype *magElement = &(magRow[threadX]);

    FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
    FPtype *phaseElement = &(phaseRow[threadX]);

    *magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
    *phaseElement = atan2(complex.y, complex.x);
}

唯一浪费线程以下是其中宽度或高度不每块的线程数的倍数的情况下。

The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

这篇关于你如何通过CUDA投遍历数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆