OpenCL(Cuda)中的元素操作 [英] Elementwise operations in OpenCL (Cuda)

查看:109
本文介绍了OpenCL(Cuda)中的元素操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个用于两个矩阵元素相乘的内核,但是至少在我的配置下,我的OpenCL内核仅在每个矩阵大于2GB时才更快.所以我在想,这是因为我的天真的内核(请参阅下文)还是由于元素操作的性质,这意味着元素操作不会因使用GPU而受益.

I build a kernel for elementwise multiplication of two matrices, but at least with my configurations my OpenCL kernel is only faster when each matrices is larger than 2GB. So I was wondering, if it is because of my naive kernel (see below) or because of the nature of elementwise operations, meaning that elementwise operations dont gain from using GPUs.

感谢您的输入!

内核:

KERNEL_CODE = """
// elementwise multiplication: C = A .* B.
__kernel void matrixMul(
        __global float* C,
        __global float* A,
        __global float* B,
        int width, int height)
{
    // ID
    int x = get_global_id(0);
    int y = get_global_id(1);

    // Multiplying
    C[y * height + x ] = A[y * height + x] * B[y * height + x];
}
"""

p.s.我读过一些专家认为,CUDA与OpenCL太不同,无法在同一问题中回答这两个问题,因此可以随意将其从标题和标签中删除.

p.s. I read some experts think, CUDA is too different from OpenCL to answer for both in the same question, fell free to remove it from title and tags.

推荐答案

这种操作具有N个FLOP,但是有3N个内存事务,因此它将完全与内存带宽绑定.没有数据可重复使用的范围,因此在参考CPU版本上加速的上限是GPU与CPU带宽的比率.这个数字很少超过10倍,并且由于将数据移入和移出GPU内存的成本很快就会被侵蚀.一般来说,最好将这种操作与其他O(N)操作融合"以提高性能.通常,您永远不会只在单个内核中计算Hadamard乘积,而是将其作为一个内核中一系列O(N)操作的一部分来执行.因此,不,即使内核是最佳的,这也不是提高速度的理想选择.

That sort of operation has N FLOPs, but 3N memory transactions, so it will be completely memory bandwidth bound. There is no scope for data re-use, so the upper bound of speed up over the reference CPU version is the ratio of GPU to CPU bandwidth. That number is rarely more than 10 times, and can get eroded pretty quickly by the cost of moving the data to and from GPU memory. Generally speaking, this sort of operation is best "fused" with other O(N) operations to improve performance. You would usually never just compute the Hadamard product in a single kernel, you would do it as part of a series of O(N) operations within one kernel. So, no, this is not a great candidate for speed up, even if the kernel were optimal.

您的内核肯定不是.您为每个FLOP做3个IOP,这是一个巨大的代价.您绝对可以做些事情来改善这一点,但是什么事情将完全取决于它将运行哪种硬件.

And your kernel definitely isn't. You are doing 3 IOPs for every FLOP, which is a huge penalty. You could definitely do things to improve this, but what things will depend completely on what sort of hardware this is going to run on.

这篇关于OpenCL(Cuda)中的元素操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆