CUB(CUDA UnBound)等效于thrust :: gather [英] CUB (CUDA UnBound) equivalent of thrust::gather

查看:1275
本文介绍了CUB(CUDA UnBound)等效于thrust :: gather的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于Thrust库存在一些性能问题(请参见这个页面的更多细节),我计划重新考虑一个CUDA应用程序使用CUB而不是Thrust。具体来说,替换thrust :: sort_by_key和thrust :: inclusive_scan调用)。在我的应用程序中的一个特定的点,我需要通过键排序3数组。这是我这样做与推力:

Due to some performance issues with the Thrust libraries (see this page for more details), I am planning on re-factoring a CUDA application to use CUB instead of Thrust. Specifically, to replace the thrust::sort_by_key and thrust::inclusive_scan calls). In a particular point in my application I need to sort 3 arrays by key. This is how I did this with thrust:

thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys, 
      thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
      thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);

其中


  • key iter 是一个thrust :: device_ptr,指向我要按
  • 排序的键

  • 值{1,2,3} Ptr
  • 是我想要排序的值的device_ptrs
  • 值{1,2,3} OutPtr 到排序后的值

  • key iter is a thrust::device_ptr that points to the keys i want to sort by
  • indices point to a sequence (from 0 to numKeys-1) in device memory
  • values{1,2,3}Ptr are device_ptrs to the values i want to sort
  • values{1,2,3}OutPtr are device_ptrs to the sorted values

使用 CUB SortPairs 函数我可以排序单个值缓冲区,但不是所有的3在一个镜头。问题是我没有看到任何CUB收集像实用程序。建议?

With the CUB SortPairs function I can sort a single value buffer, but not all 3 in one shot. Problem is I don't see any CUB "gather-like" utilities. Suggestions?

编辑:

我想我可以实现自己的collect内核,但是除此之外还有更好的方法:

I suppose I could implement my own gather kernel, but is there any better way to do this other than:

template <typename Index, typename Value> 
__global__ void  gather_kernel(const unsigned int N, const Index * map, 
const Value * src, Value * dst) 
{ 
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; 
    if (i < N) 
    { 
        dst[i] = src[map[i]]; 
    } 
} 

非合并加载和商店让我chringe,但是在 map 上没有已知结构可能不可避免。

The non-coalesed loads and stores make me chringe, but it probably unavoidable without a known structure on map.

推荐答案

看起来你想要实现取决于 thrust :: zip_iterator 。您可以

It seems what you want to achieve depends on thrust::zip_iterator. You could either


  1. 只能用 thrust :: sort_by_key > cub :: DeviceRadixSort :: SortPairs 并保留 thrust :: gather

  2. zip 值{1,2,3} 转换为结构数组,然后使用 cub :: DeviceRadixSort :: SortPairs

  1. only replace thrust::sort_by_key by cub::DeviceRadixSort::SortPairs and keep thrust::gather, or
  2. zip values{1,2,3} into array of structures before using cub::DeviceRadixSort::SortPairs



更新



读取 / code>,

update

After reading the implementation of thrust::gather,

$CUDA_HOME/include/thrust/system/detail/generic/gather.inl

您可以看到它只是一个天真的内核,如

you can see it is only a naive kernel like

__global__ gather(int* index, float* in, float* out, int len) {
  int i=...;
  if (i<len) { out[i] = in[index[i]]; }
}

然后我认为你的代码可以被单个内核替换

Then I think your code above can be replaced by a single kernel without too much effort.

在这个内核中,你可以首先使用CUB block-wize原语 cub :: BlockRadixSort< ...> :: SortBlockedToStriped 以获取存储在寄存器中的排序索引,然后执行一个初始重新排序复制 thrust :: gather 以填充值{1,2,3}输出

In this kernel, you could first use the CUB block-wize primitive cub::BlockRadixSort<...>::SortBlockedToStriped to get the sorted indices stored in registers and then perform a naive re-order copy as thrust::gather to fill values{1,2,3}Out.

使用 SortBlockedToStriped 而不是值时,classcub_1_1_block_radix_sort.html#aac3d9424388b0cb6da360624e627b61erel =nofollow> 排序 可以执行合并写入

这篇关于CUB(CUDA UnBound)等效于thrust :: gather的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆