CUB(CUDA UnBound)等效于thrust :: gather [英] CUB (CUDA UnBound) equivalent of thrust::gather
问题描述
由于Thrust库存在一些性能问题(请参见这个页面的更多细节),我计划重新考虑一个CUDA应用程序使用CUB而不是Thrust。具体来说,替换thrust :: sort_by_key和thrust :: inclusive_scan调用)。在我的应用程序中的一个特定的点,我需要通过键排序3数组。这是我这样做与推力:
Due to some performance issues with the Thrust libraries (see this page for more details), I am planning on re-factoring a CUDA application to use CUB instead of Thrust. Specifically, to replace the thrust::sort_by_key and thrust::inclusive_scan calls). In a particular point in my application I need to sort 3 arrays by key. This is how I did this with thrust:
thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys,
thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);
其中
-
key iter
是一个thrust :: device_ptr,指向我要按 排序的键 -
-
值{1,2,3} Ptr
/ code>是我想要排序的值的device_ptrs -
值{1,2,3} OutPtr
到排序后的值
key iter
is a thrust::device_ptr that points to the keys i want to sort byindices
point to a sequence (from 0 to numKeys-1) in device memoryvalues{1,2,3}Ptr
are device_ptrs to the values i want to sortvalues{1,2,3}OutPtr
are device_ptrs to the sorted values
使用 CUB SortPairs 函数我可以排序单个值缓冲区,但不是所有的3在一个镜头。问题是我没有看到任何CUB收集像实用程序。建议?
With the CUB SortPairs function I can sort a single value buffer, but not all 3 in one shot. Problem is I don't see any CUB "gather-like" utilities. Suggestions?
编辑:
我想我可以实现自己的collect内核,但是除此之外还有更好的方法:
I suppose I could implement my own gather kernel, but is there any better way to do this other than:
template <typename Index, typename Value>
__global__ void gather_kernel(const unsigned int N, const Index * map,
const Value * src, Value * dst)
{
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
{
dst[i] = src[map[i]];
}
}
非合并加载和商店让我chringe,但是在 map
上没有已知结构可能不可避免。
The non-coalesed loads and stores make me chringe, but it probably unavoidable without a known structure on map
.
推荐答案
看起来你想要实现取决于 thrust :: zip_iterator
。您可以
It seems what you want to achieve depends on thrust::zip_iterator
. You could either
- 只能用
thrust :: sort_by_key
> cub :: DeviceRadixSort :: SortPairs 并保留thrust :: gather
或 - zip
值{1,2,3}
转换为结构数组,然后使用cub :: DeviceRadixSort :: SortPairs
- only replace
thrust::sort_by_key
bycub::DeviceRadixSort::SortPairs
and keepthrust::gather
, or - zip
values{1,2,3}
into array of structures before usingcub::DeviceRadixSort::SortPairs
更新
读取 / code>,
update
After reading the implementation of thrust::gather
,
$CUDA_HOME/include/thrust/system/detail/generic/gather.inl
您可以看到它只是一个天真的内核,如
you can see it is only a naive kernel like
__global__ gather(int* index, float* in, float* out, int len) {
int i=...;
if (i<len) { out[i] = in[index[i]]; }
}
然后我认为你的代码可以被单个内核替换
Then I think your code above can be replaced by a single kernel without too much effort.
在这个内核中,你可以首先使用CUB block-wize原语 cub :: BlockRadixSort< ...> :: SortBlockedToStriped
以获取存储在寄存器中的排序索引,然后执行一个初始重新排序复制 thrust :: gather
以填充值{1,2,3}输出
In this kernel, you could first use the CUB block-wize primitive cub::BlockRadixSort<...>::SortBlockedToStriped
to get the sorted indices stored in registers and then perform a naive re-order copy as thrust::gather
to fill values{1,2,3}Out
.
使用 SortBlockedToStriped
而不是值时,classcub_1_1_block_radix_sort.html#aac3d9424388b0cb6da360624e627b61erel =nofollow> 排序
可以执行合并写入。
这篇关于CUB(CUDA UnBound)等效于thrust :: gather的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!