CUDA:如何直接在GPU上使用thrust :: sort_by_key? [英] CUDA: how to use thrust::sort_by_key directly on the GPU?
问题描述
Thrust库可用于对数据进行排序。调用可能看起来像这样(使用键和值向量):
The Thrust library can be used to sort data. The call might look like this (with a keys and a values vector):
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_values.begin());
在CPU上调用, d_keys
d_values
在CPU内存中;并且大部分的执行发生在GPU上。
called on the CPU, with d_keys
and d_values
being in the CPU memory; and the bulk of the execution happens on the GPU.
但是,我的数据已经在GPU上了?如何使用Thrust库直接在GPU上执行高效排序,即从内核调用 sort_by_key
函数?
However, my data is already on the GPU? How can I use the Thrust library to perform efficient sorting directly on the GPU, i.e., to call the sort_by_key
function from a kernel?
此外,我的数据包括
unsigned long long int
或 unsigned int
和总是 unsigned int
的数据。
Also, my data consists of keys that are either
unsigned long long int
or unsigned int
and data that is always unsigned int
. How should I make the thrust call for these types?
推荐答案
如Talonmies链接的问题所述,您不能从CUDA函数(例如 __ device __
或 __ global __
)。但是,这并不意味着您不能使用Thrust已经在设备内存中使用的数据。相反,您可以使用包装原始数据的Thrust向量从主机调用所需的Thrust函数。例如
As stated in the question Talonmies linked, you cannot call Thrust from a CUDA function (e.g. __device__
or __global__
). However, this doesn't mean you can't use data you already have in device memory with Thrust. Rather, you call the desired Thrust functions from the host using Thrust vectors wrapping your raw data. e.g.
//raw pointer to device memory
unsigned int * raw_data;
unsigned int * raw_keys;
//allocate device memory for data and keys
cudaMalloc((void **) &raw_data, N_data * sizeof(int));
cudaMalloc((void **) &raw_keys, N_keys * sizeof(int));
//populate your device pointers in your kernel
kernel<<<...>>>(raw_data, raw_keys, ...);
...
//wrap raw pointer with a device_ptr to use with Thrust functions
thrust::device_ptr<unsigned int> dev_data_ptr(raw_data);
thrust::device_ptr<unsigned int> dev_keys_ptr(raw_keys);
//use the device memory with a thrust call
thrust::sort_by_key(d_keys, d_keys + N_keys, dev_data_ptr);
raw_data
指向的设备内存 raw_keys
在使用 Thrust :: device_ptr
包装时仍然在设备内存中,因此在调用Thrust功能从主机,它不必将任何内存从主机复制到设备,反之亦然。也就是说,您使用设备内存直接在GPU上进行排序;你唯一的开销是启动Thrust内核和包装原始设备指针。
The device memory pointed to by raw_data
and raw_keys
are still in device memory when you wrap them with Thrust::device_ptr
, so while you're calling the Thrust function from the host, it doesn't have to copy any memory from host to device or vice versa. That is, you're sorting directly on the GPU, using device memory; the only overhead you would have is in launching the Thrust kernel(s) and wrapping the raw device pointers.
当然,如果你需要在常规CUDA内核中使用它们,你可以得到你的原始指针:
And of course, you can get your raw pointers back if you need to use them in a regular CUDA kernel afterward:
unsigned int * raw_ptr = thrust::raw_pointer_cast(dev_data_ptr);
至于使用 unsigned long long int
或 unsigned int
作为您的键的数据 unsigned int
,这不是一个问题,因为Thrust是模板。也就是说, sort_by_key
的签名是
As for using either unsigned long long int
or unsigned int
as your keys with data that's unsigned int
, this isn't a problem, as Thrust is templated. That is, the signature for sort_by_key
is
template<typename RandomAccessIterator1 , typename RandomAccessIterator2 >
void thrust::sort_by_key(
RandomAccessIterator1 keys_first,
RandomAccessIterator1 keys_last,
RandomAccessIterator2 values_first )
意味着您可以为键和数据使用不同的类型。只要所有的键类型是同质的给定的调用,Thrust应该能够自动推断类型,你不必做任何特别的。希望这有意义
meaning that you can have different types for the keys and data. As long as all of your key-types are homogenous for a given call, Thrust should be able to infer the types automatically and you won't have to do anything special. Hopefully that makes sense
这篇关于CUDA:如何直接在GPU上使用thrust :: sort_by_key?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!