快速 CUDA 推力自定义比较运算符 [英] fast CUDA thrust custom comparison operator

查看：59 发布时间：2022/1/10 15:54:21 cuda thrust

本文介绍了快速 CUDA 推力自定义比较运算符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在评估 CUDA，目前正在使用 Thrust 库对数字进行排序.

I'm evaluating CUDA and currently using Thrust library to sort numbers.

我想为推力::排序创建自己的比较器，但它的速度大大减慢！我通过从 functional.h 复制代码创建了自己的 less 实现.但是，它似乎以其他方式编译并且运行速度非常慢.

I'd like to create my own comparer for thrust::sort, but it slows down drammatically! I created my own less implemetation by just copying code from functional.h. However it seems to be compiled in some other way and works very slowly.

默认比较器:thrust::less() - 94ms
我自己的比较器:less() - 906ms

我使用的是 Visual Studio 2010.我应该怎么做才能获得与选项 1 相同的性能?

I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?

完整代码:

#include <stdio.h>

#include <cuda.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>

int myRand()
{
        static int counter = 0;
        if ( counter++ % 10000 == 0 )
                srand(time(NULL)+counter);
        return (rand()<<16) | rand();
}

template<typename T>
struct less : public thrust::binary_function<T,T,bool>
{
  __host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
     return lhs < rhs;
  }
}; 

int main()
{
    thrust::host_vector<int> h_vec(10 * 1000 * 1000);
    thrust::generate(h_vec.begin(), h_vec.end(), myRand);

    thrust::device_vector<int> d_vec = h_vec;

    int clc = clock();
    thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
    printf("%dms
", (clock()-clc) * 1000 / CLOCKS_PER_SEC);

    return 0;
}

推荐答案

您观察到性能差异的原因是 Thrust 使用不同的算法实现排序，具体取决于提供给 thrust::sort.


The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to thrust::sort.
在情况 1. 中，Thrust 可以证明可以使用基数排序在线性时间内实现排序.这是因为要排序的数据类型是内置的数值类型(int)，比较函数是内置的小于运算——Thrust识别出thrust::less<int> 将产生与 x < 相同的结果.是的.
In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort.  This is because the type of the data to sort is a built-in numeric type (int), and the comparison function is the built-in less than operation -- Thrust recognizes that thrust::less<int> will produce the equivalent result as x < y.
在情况 2. 中，Thrust 对用户提供的 less<int> 一无所知，并且必须使用基于具有不同渐近复杂度的比较排序的更保守算法，即使在事实上，你的 less<int> 等价于 thrust::less<int>.
In case 2., Thrust knows nothing about your user-provided less<int>, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth your less<int> is equivalent to thrust::less<int>.
一般来说，用户定义的比较运算符不能与更严格、更快的排序一起使用，这些排序操作数据的二进制表示，例如基数排序.在这些情况下，Thrust 会使用更通用但更慢的排序方式.
In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.

                        这篇关于快速 CUDA 推力自定义比较运算符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

快速 CUDA 推力自定义比较运算符 [英] fast CUDA thrust custom comparison operator

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

快速 CUDA 推力自定义比较运算符 [英] fast CUDA thrust custom comparison operator

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭