OpenCL内核的性能非常差吗? [英] OpenCL kernel performing very poor?

查看:220
本文介绍了OpenCL内核的性能非常差吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序使用 GPU上的OpenCL 花费 5200ms 来计算数据集,使用CPU上的 OpenCL花费相同的数据花费 330ms ;而 在不使用多个线程的CPU上在没有OpenCL的情况下完成相同的数据处理则需要110ms . OpenCL计时仅针对内核执行完成,即在clEnqueueNDRangeKernel之前开始,在clFinish之后结束. 我有一个Windows小工具,它告诉我我仅使用19%的GPU.即使我可以做到100%,它仍然需要〜1000ms,这比我的CPU高得多.

My application takes 5200ms for computation of a data set using OpenCL on GPU, 330ms for same data using OpenCL on CPU; while the same data processing when done without OpenCL on CPU using multiple threads takes 110ms. The OpenCL timing is done only for kernel execution i.e. start just before clEnqueueNDRangeKernel and end just after clFinish. I have a Windows gadget which tells me that I am only using 19% GPU power. Even if I could make it to 100% still it would take ~1000ms which is much higher than my CPU.

工作组的大小是CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE的倍数,我正在使用所有计算单元(GPU为6,CPU为4).这是我的内核:

The work group size is a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and I am using all computation units (6 for GPU and 4 for CPU). Here is my kernel:

__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
    size_t gid = get_global_id(0);

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

对于另一个内核,我的性能也很差:

I am getting similar poor performance for another kernel:

__kernel void calcURatios(__global myreal *ratios, __global myreal *rhs, myreal c, myreal r)
{
    size_t gid = get_global_id(0);

    myreal pRatios = ratios[gid];
    myreal pRHS = rhs[gid];

    pRatios = pRatios / c;
    ratios[gid] = pRatios;

    //pRatios = pRatios * r;
    pRHS = pRHS - (pRatios * r);
    rhs[gid] = pRHS;
}

问题:

  1. 为什么与OpenCL上的CPU相比,我的GPU性能这么差?
  2. 为什么OpenCL 3X上的CPU比没有OpenCL但多的CPU慢? 线程化了吗?
  1. Why is my GPU performing so poor compared to CPU on OpenCL.
  2. Why is CPU on OpenCL 3X slower than CPU without OpenCL but multi threaded?

推荐答案

也许您可以添加一些有关如何使该内核入队的信息-也许本地工作量不合适? (有疑问,只需将null作为本地工作大小传递-OpenCL将选择适当的大小).

Maybe you could add some information about how you enqueue this kernel - maybe with an inappropriate local work size? (In doubt, just pass null as the local work size - OpenCL will choose an appropriate one).

但是即使在最好的情况下,您也不太可能在这里看到加速.您在那里进行的计算占用大量内存.在第一个内核中,您要从全局内存中读取两个元素,然后执行琐碎的减法/乘法运算,然后将一个元素写入到全局内存中(在第二个内核中,差别不大).这里的瓶颈根本不是计算,而是数据传输.

But even in the best case, it's unlikely that you will see a speedup here. The computation that you are doing there is heavily memory-bound. In the first kernel, you are reading two elements from global memory, then performing a trivial subtraction/multiplication, and afterwards writing an element to global memory (and in the second kernel, it's not much different). The bottleneck here is simply not the computation, but the data transfer.

(顺便说一句:最近,我在 https://stackoverflow.com/a/22868938 ).

(BTW: Recently, I wrote a few general words about that in https://stackoverflow.com/a/22868938 ).

也许统一内存,HSA,AMD Kaveri等的新开发将在这里进行救援,但这仍处于早期阶段.

Maybe the new developments of Unified Memory, HSA, AMD Kaveri etc. will come for the rescue here, but this is still in an early stage.

也许您还可以描述在哪个上下文中执行这些计算.如果您还有进一步的计算(内核)可以处理此内核的结果,则可以将它们组合在一起以提高内存/计算比.

Maybe you could also describe in which context you are performing these computations. If you have further computations (kernels) that work on the results of this kernel, maybe they could be combined on order to improve the memory/computation ratio.

这篇关于OpenCL内核的性能非常差吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆