快速CUDA推力定制比较运算符 [英] fast CUDA thrust custom comparison operator

查看:223
本文介绍了快速CUDA推力定制比较运算符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在评估CUDA,目前使用Thrust库对数字进行排序。



我想为thrust :: sort创建自己的比较器,下来drammically!
我只需从 functional.h 中复制代码即可创建自己的较少实现。


  1. 默认比较器:thrust :: less() - 94 ms


  2. 我自己的比较器:less() - 906

    我使用Visual Studio 2010.我应该怎么做才能获得与选项1相同的性能?



    完成代码:

      #include< stdio.h> 

    #include< cuda.h>

    #include< thrust / host_vector.h>
    #include< thrust / device_vector.h>
    #include< thrust / generate.h>
    #include< thrust / sort.h>

    int myRand()
    {
    static int counter = 0;
    if(counter ++%10000 == 0)
    srand(time(NULL)+ counter);
    return(rand()<< 16)| rand();
    }

    template< typename T>
    struct less:public thrust :: binary_function< T,T,bool>
    {
    __host__ __device__ bool operator()(const T& lhs,const T& rhs)const {
    return lhs& rhs;
    }
    };

    int main()
    {
    thrust :: host_vector< int> h_vec(10 * 1000 * 1000);
    thrust :: generate(h_vec.begin(),h_vec.end(),myRand);

    thrust :: device_vector< int> d_vec = h_vec;

    int clc = clock();
    thrust :: sort(d_vec.begin(),d_vec.end(),less< int>());
    printf(%dms\\\
    ,(clock() - clc)* 1000 / CLOCKS_PER_SEC);

    return 0;
    }


    解决方案

    性能上的差异是因为Thrust根据提供给 thrust :: sort 的参数实现不同算法的排序。



    在情况1中,Thrust可以证明该排序可以在具有基数排序的线性时间中实现。这是因为要排序的数据类型是内置的数字类型( int ),比较函数是内置的小于操作 - Thrust识别 thrust :: less< int> 将产生与 x <



    在情况2中,Thrust不了解您的用户提供的 less< int> ,并且必须使用基于具有不同渐近复杂度的比较排序的更保守的算法,即使事实上更少< int> code> thrust :: less< int>



    一般来说,用户定义的比较运算符不能与更多限制性,更快的排序,其操纵数据的二进制表示,例如基数排序。在这些情况下,Thrust可以退回更一般的,但排序较慢。


    I'm evaluating CUDA and currently using Thrust library to sort numbers.

    I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h. However it seems to be compiled in some other way and works very slowly.

    1. default comparer: thrust::less() - 94ms
    2. my own comparer: less() - 906ms

    I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?

    Complete code:

    #include <stdio.h>
    
    #include <cuda.h>
    
    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    
    int myRand()
    {
            static int counter = 0;
            if ( counter++ % 10000 == 0 )
                    srand(time(NULL)+counter);
            return (rand()<<16) | rand();
    }
    
    template<typename T>
    struct less : public thrust::binary_function<T,T,bool>
    {
      __host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
         return lhs < rhs;
      }
    }; 
    
    int main()
    {
        thrust::host_vector<int> h_vec(10 * 1000 * 1000);
        thrust::generate(h_vec.begin(), h_vec.end(), myRand);
    
        thrust::device_vector<int> d_vec = h_vec;
    
        int clc = clock();
        thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
        printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC);
    
        return 0;
    }
    

    解决方案

    The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to thrust::sort.

    In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort. This is because the type of the data to sort is a built-in numeric type (int), and the comparison function is the built-in less than operation -- Thrust recognizes that thrust::less<int> will produce the equivalent result as x < y.

    In case 2., Thrust knows nothing about your user-provided less<int>, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth your less<int> is equivalent to thrust::less<int>.

    In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.

    这篇关于快速CUDA推力定制比较运算符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆