快速CUDA推力定制比较运算符 [英] fast CUDA thrust custom comparison operator
问题描述
我正在评估CUDA,目前使用Thrust库对数字进行排序。
我想为thrust :: sort创建自己的比较器,下来drammically!
我只需从 functional.h 中复制代码即可创建自己的较少实现。
- 默认比较器:thrust :: less() - 94 ms
- 我自己的比较器:less() - 906
我使用Visual Studio 2010.我应该怎么做才能获得与选项1相同的性能?
完成代码:
#include< stdio.h>
#include< cuda.h>
#include< thrust / host_vector.h>
#include< thrust / device_vector.h>
#include< thrust / generate.h>
#include< thrust / sort.h>
int myRand()
{
static int counter = 0;
if(counter ++%10000 == 0)
srand(time(NULL)+ counter);
return(rand()<< 16)| rand();
}
template< typename T>
struct less:public thrust :: binary_function< T,T,bool>
{
__host__ __device__ bool operator()(const T& lhs,const T& rhs)const {
return lhs& rhs;
}
};
int main()
{
thrust :: host_vector< int> h_vec(10 * 1000 * 1000);
thrust :: generate(h_vec.begin(),h_vec.end(),myRand);
thrust :: device_vector< int> d_vec = h_vec;
int clc = clock();
thrust :: sort(d_vec.begin(),d_vec.end(),less< int>());
printf(%dms\\\
,(clock() - clc)* 1000 / CLOCKS_PER_SEC);
return 0;
}
解决方案性能上的差异是因为Thrust根据提供给
thrust :: sort
的参数实现不同算法的排序。
在情况1中,Thrust可以证明该排序可以在具有基数排序的线性时间中实现。这是因为要排序的数据类型是内置的数字类型(
int
),比较函数是内置的小于操作 - Thrust识别thrust :: less< int>
将产生与x <
在情况2中,Thrust不了解您的用户提供的
less< int>
,并且必须使用基于具有不同渐近复杂度的比较排序的更保守的算法,即使事实上更少< int>
code> thrust :: less< int> 。
一般来说,用户定义的比较运算符不能与更多限制性,更快的排序,其操纵数据的二进制表示,例如基数排序。在这些情况下,Thrust可以退回更一般的,但排序较慢。
I'm evaluating CUDA and currently using Thrust library to sort numbers.
I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h. However it seems to be compiled in some other way and works very slowly.
- default comparer: thrust::less() - 94ms
- my own comparer: less() - 906ms
I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?
Complete code:
#include <stdio.h> #include <cuda.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/generate.h> #include <thrust/sort.h> int myRand() { static int counter = 0; if ( counter++ % 10000 == 0 ) srand(time(NULL)+counter); return (rand()<<16) | rand(); } template<typename T> struct less : public thrust::binary_function<T,T,bool> { __host__ __device__ bool operator()(const T &lhs, const T &rhs) const { return lhs < rhs; } }; int main() { thrust::host_vector<int> h_vec(10 * 1000 * 1000); thrust::generate(h_vec.begin(), h_vec.end(), myRand); thrust::device_vector<int> d_vec = h_vec; int clc = clock(); thrust::sort(d_vec.begin(), d_vec.end(), less<int>()); printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC); return 0; }
解决方案The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to
thrust::sort
.In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort. This is because the type of the data to sort is a built-in numeric type (
int
), and the comparison function is the built-in less than operation -- Thrust recognizes thatthrust::less<int>
will produce the equivalent result asx < y
.In case 2., Thrust knows nothing about your user-provided
less<int>
, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth yourless<int>
is equivalent tothrust::less<int>
.In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.
这篇关于快速CUDA推力定制比较运算符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!