在CUDA编程中,原子函数比计算中间结果后还原快吗? [英] In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

查看:64
本文介绍了在CUDA编程中,原子函数比计算中间结果后还原快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原子功能(例如 atomic_add )被广泛用于CUDA编程中的计数或求和/聚合.但是,与普通的全局内存读/写相比,我找不到有关原子函数速度的信息.

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.

请考虑以下任务,在此我们要计算具有256K元素的浮点数组.每个元素都是1000个中间变量的总和,该变量首先被计算.一种方法是使用 atomic_add ;另一种方法是使用256K * 1000的临时数组作为中间结果,然后减少该数组(通过求和).

Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).

第一种方法使用原子函数的速度快于第二种方法吗?

Is the first approach using atomic function faster than the second?

推荐答案

在您的特定情况下,即使您没有提供具体的程序,也无需了解原子和非原子之间的延迟或带宽差异.原子操作排除了您的两种方法:它们都效率很低.

In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.

您应该有单个块来处理单个输出变量(或少量输出变量),以便不通过全局存储器执行每1,000个中间变量的总和.您可能需要阅读经典"文章.马克·哈里斯(Mark Harris)的演讲:

You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:

了解基础知识.近年来,由于更新了硬件功能,因此有了改进.有关最新的实际实现,请参见 CUB库的块缩减原语.

to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.

也相关:如果以这种方式实现,则每个输出元素将只被写入一次.并且即使以某种方式需要将1,000个中间体的计算分配在多个块中,无论出于何种原因,您都没有在问题中共享-您仍应将其分配给较小的数目,而不是1,000,以便全局内存写入结果只占总计算时间的一小部分,因此除了原子加法外,不值得去烦恼.

If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

这篇关于在CUDA编程中,原子函数比计算中间结果后还原快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆