CUDA标量和SIMD视频指令的效率 [英] efficiency of CUDA Scalar and SIMD video instructions

查看:374
本文介绍了CUDA标量和SIMD视频指令的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SIMD指令的吞吐量低于32位整数运算。
在SM2.0(仅限标量指令版本)的情况下,降低2倍。



class =h2_lin>解决方案

如果您的数据已经以SIMD视频指令本地处理的格式打包,则需要多个步骤来解包,以便可以通过



此外,SIMD视频指令的吞吐量还应乘以与普通算术运算比较时实际执行的操作数。



例如,对于 vadd4 。为了使用普通整数加法重复这种行为,需要一个相当复杂的指令序列来将数据解包到4 int 数量中,执行4个整数加法,然后重新 - 结果。如果你试图用一个单一的整数加法,从一个结果的进位可能会破坏下一个结果。 vadd4 也提供夹紧功能和整数加法不可用的其他行为。



在SM2.0的情况下,只是由 vadd4 执行的4个操作与对于解压缩数据必需的4个整数的比率将使其具有吸引力。在SM3.0的情况下,当拆包和打包被添加到普通整数添加例程时, vadd4 看起来有吸引力。情况变得更具吸引力与cc 5.0 < a>。


The throughput of SIMD instruction is lower that 32-bits integer arithmetic. In case of SM2.0 (Scalar instruction only versions) is 2 time lower. In case of SM3.0 is 6 time lower.

What is a cases when suitable to use them ?

解决方案

If your data is already packed in a format that is handled natively by a SIMD video instruction, then it would require multiple steps to unpack it so that it can be handled by an ordinary instruction.

Furthermore, the throughput of a SIMD video instruction should also be multiplied by the number of actual operations performed when comparing it with ordinary arithmetic operations.

For example, for the instruction vadd4, 4 integer adds are being performed, on a packed 32-bit quantity (4 byte integer quantities). In order to duplicate this behavior with ordinary integer adds, a fairly complicated sequence of instructions would be needed to unpack the data into 4 int quantities, do 4 integer adds, and then re-pack the result. If you attempted to do it with a single integer add, carry from one result could corrupt the next result. vadd4 also offers clamping abilities and other behavior not available with integer add.

In the case of SM2.0, just the ratio of 4 operations performed by the vadd4 vs. the 4 integer adds necessary on unpacked data would make it attractive. In the case of SM3.0, when the unpacking and packing are added to the ordinary integer add routine, the vadd4 looks attractive. The situation becomes even more attractive with cc 5.0.

这篇关于CUDA标量和SIMD视频指令的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆