CUDA 9中附加了一些以`_sync()`命名的内部函数;语义相同吗? [英] Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

查看:524
本文介绍了CUDA 9中附加了一些以`_sync()`命名的内部函数;语义相同吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CUDA 9中,nVIDIA似乎有了这种合作团体"的新概念;由于某种原因(我不太清楚),现在不推荐使用__ballot()(= CUDA 9),而推荐使用__ballot_sync().是别名还是语义发生了变化?

...类似的问题,对于现在已在其名称中添加了__sync()的其他内建程序.

解决方案

没有语义是不一样的.函数调用本身是不同的,一个不是另一个的别名,已经公开了新功能,并且现在Volta体系结构和以前的体系结构之间的实现行为也有所不同.

首先,要奠定基础,必须认识到Volta 编程指南:

但是,请注意,对于Pascal和更早的体系结构,mask中的所有线程必须在收敛中执行相同的warp内部指令,并且mask中所有值的并集必须等于warp的活动mask.

但是,在Volta上,warp执行引擎将在掩码中的指示线程之间带来必要的同步/参与,以使所需的/指示的操作有效(假定内部函数的适当的_sync版本是用过的).需要明确的是,warp执行引擎将重新收敛在volta上发散的线程以匹配掩码,但是它将无法克服程序员引起的错误,例如阻止线程通过条件语句参与_sync()内在函数. >

与此相关的问题讨论了mask参数.该答案并非旨在解决独立线程调度可能产生的所有可能问题,以及对经纱级内在函数的影响.为此,我鼓励阅读编程指南.

In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed?

... similar question for other builtins which now have __sync() added to their names.

解决方案

No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.

First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.

Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:

Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.

On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync() intrinsic via conditional statements.

This related question discusses the mask parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.

这篇关于CUDA 9中附加了一些以`_sync()`命名的内部函数;语义相同吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆