了解工作项和工作组 [英] Understanding work-items and work-groups

查看:221
本文介绍了了解工作项和工作组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我的上一个问题:

我仍然在尝试复制图片(没有实际的原因,只是从一个简单的开始):

I'm still trying to copy an image (no practical reason, just to start with an easy one):

图片包含200 * 300 == 60000像素。

The image contains 200 * 300 == 60000 pixels.

根据 CL_DEVICE_MAX_WORK_GROUP_SIZE ,工作项的最大数目为4100。

The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.

kernel1:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
            "}";

队列:

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();

产生segfault,有什么问题?

最后一个参数 cl :: NDRange(20000)它不会,但只返回部分图像。

With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.

我也不明白为什么我不能使用这个内核:

Also I don't understand, why I can't use this kernel:

kernel2:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

查看演示文稿的第31张幻灯片:

Looking at this presentation on the 31th slide:

为什么我只能使用global_id?

EDIT1

Platfrom:AMD加速并行处理

Platfrom: AMD Accelerated Parallel Processing

设备:AMD Athlon(tm)II P320双核处理器

Device: AMD Athlon(tm) II P320 Dual-Core Processor

EDIT2

结果基于huseyin tugrul buyukisik的回答:

The result based on huseyin tugrul buyukisik's answer:

EDIT3

使用最后一个参数 cl :: NDRange(20000)

>

内核是第一个方法。

EDIT4

std::string kernelCode =
                "void kernel copy(global const int* image, global int* result)"
                "{"
                    "result[get_global_id(0)] = image[get_global_id(0)];"
                "}";
//...
cl_int err;
    err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));

    if (err == 0)
        qDebug() << "success";
    else
    {
        qDebug() << err;
        exit(1);
    }

打印成功。

也许这是错误的?

int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);

有罪的是

cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);

我使用 size $ c> sizeof(int)* size

I used size instead of sizeof(int) * size.

推荐答案

strong>

Edit 2:

请尝试使用非常数内存说明符(可能与您的cpu不兼容):

Try non constant memory specifier please(maybe not compatible with your cpu):

std::string kernelCode =
            "__kernel void copy(__global int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

也可能需要更改缓冲区选项。

also you may need to change buffer options too.

编辑

您在全局和内核指令前忘记了三个<所以请尝试:

You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:

std::string kernelCode =
            "__kernel void copy(__global const int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

总元素是60000,但是你做一个偏移量+ 60000,溢出和读取/写入非特权区域。

Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.

对于opencl 1.2 c ++绑定,ndrange的常见用法必须是

cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);

然后检查err找到的真正的错误代码。 **

Then check err for the real error code you seek. 0 means succeess.**

如果您想将工作分成更小的部分,您应该将每个单元的范围限制为60000 / N

If you want to divide work into smaller parts you should cap the range of each unit by 60000/N

如果除以30,则

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();

并仔细检查每个缓冲区的大小。 sizeof(cl_int)* arrElementNumber

成为整数的大小可能与设备整数不同。你需要60000个元素?然后,在创建缓冲区时,您需要240000个字节作为大小传递。

Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.

为了兼容性,如果要运行此代码,应在创建缓冲区之前检查整数的大小在另一台机器上。

For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.

您可能已经知道这一点,但我仍然会告诉:

You may know this already but Im gonna tell anyway:

CL_DEVICE_MAX_WORK_GROUP_SIZE

是可以在计算单元中共享本地/共享内存的线程数。你不需要为你的工作分配。 Opencl会自动执行此操作,并为整个工作中的每个线程提供唯一的全局ID,并为计算单元中的每个线程提供唯一的本地ID。如果CL_DEVICE_MAX_WORK_GROUP_SIZE为4100,那么它可以创建在计算单元中共享相同变量的线程。您可以在一个扫描中只计算一个adition计算所有60000个变量:为此创建多个工作组,每个组都有一个组ID。

is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.

  // this should work without a problem
  queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));

如果你有一个AMD gpu或cpu,如果你使用msvc,你可以安装codexl从amd网站并从下拉菜单中选择系统信息以查看相关数字。

If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.

哪些设备是您的设备?我找不到任何设备的最大工作组大小为4100!

Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?

例如,总工作项在这里可以大到256 * 256次工作组大小。这是一个xeon-phi。

For example total work items can be as big as 256*256 times work group size here.

Codexl有其他不错的功能,如性能分析,跟踪代码,如果你需要最大的性能和错误修复。

Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.

这篇关于了解工作项和工作组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆