如何在OpenCL中验证波前/扭曲大小? [英] How to verify wavefront/warp size in OpenCL?

查看:101
本文介绍了如何在OpenCL中验证波前/扭曲大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AMD Radeon HD 7700 GPU.我想使用以下内核来验证波前大小为64.

__kernel
void kernel__test_warpsize(
        __global T* dataSet,
        uint size
        )
{   
    size_t idx = get_global_id(0);

    T value = dataSet[idx];
    if (idx<size-1)
        dataSet[idx+1] = value;
}

在主程序中,我传递了一个包含128个元素的数组.初始值为dataSet [i] = i.在内核之后,我期望以下值: dataSet [0] = 0 dataSet [1] = 0 dataSet [2] = 1 ... dataSet [63] = 62 dataSet [64] = 63 dataSet [65] = 63 dataSet [66] = 65 ... dataSet [127] = 126

但是,我发现dataSet [65]是64,而不是63,这不是我的期望.

我的理解是,第一个波前(64个线程)应该将dataSet [64]更改为63.因此,当执行第二个波前时,线程#64应该获得63并将其写入dataSet [65].但是我看到dataSet [65]仍然是64.为什么?

解决方案

您正在调用未定义的行为.如果您要访问内存,则正在编写工作组中的另一个线程,则必须使用屏障.

此外,假设GPU同时运行2个波前.然后dataSet [65]确实包含正确的值,第一个波前尚未完全完成.

根据规范,所有项目的输出均为0也是有效的结果.这是因为一切也可以完全按顺序执行.这就是为什么您需要障碍.

根据您的评论,我编辑了这一部分:

安装 http://developer.amd.com/tools-和-sdks/异构计算/codexl/ 阅读: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf >

在一定数量的线程内优化分支只是优化的一小部分.您应该阅读AMD HW如何安排工作组中的波前以及如何通过交错执行波前(在工作组中)来隐藏内存延迟.分支还会影响整个工作组的执行,因为运行的有效时间与执行单个最长运行的波前的时间基本相同(它无法释放本地内存等,直到组中的所有内容都完成,因此无法安排另一个工作组).但这还取决于您的本地内存和寄存器使用情况等.要查看实际发生的情况,只需获取CodeXL并运行GPU性能分析运行即可.这将准确显示设备上发生的情况.

甚至这仅适用于当前一代的硬件.这就是为什么该概念不在OpenCL规范本身上的原因.这些属性变化很大,并且在很大程度上取决于硬件.

但是,如果您真的想知道AMD波前尺寸是多少,答案大概是64(请参阅 http://devgurus.amd.com/thread/159153 ,以参考其OpenCL编程指南).对于组成其整个当前产品线的所有GCN设备,它均为64.也许某些较旧的设备只有16或32,但是现在一切都只有64(对于nvidia来说一般是32).

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.

__kernel
void kernel__test_warpsize(
        __global T* dataSet,
        uint size
        )
{   
    size_t idx = get_global_id(0);

    T value = dataSet[idx];
    if (idx<size-1)
        dataSet[idx+1] = value;
}

In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values: dataSet[0]=0 dataSet[1]=0 dataSet[2]=1 ... dataSet[63]=62 dataSet[64]=63 dataSet[65]=63 dataSet[66]=65 ... dataSet[127]=126

However, I found dataSet[65] is 64, not 63, which is not as my expectation.

My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?

解决方案

You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.

In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.

Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.

Based on your comments I edited this part:

Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/ Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.

And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.

But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).

这篇关于如何在OpenCL中验证波前/扭曲大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆