如何快速推力::排序和什么是最快的基数排序实现 [英] how fast is thrust::sort and what is the fastest radix sort implementation

查看:390
本文介绍了如何快速推力::排序和什么是最快的基数排序实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是GPU编程的新手。最近,我试图实现基于教程的gpu bvh构建算法: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/ 。在该算法的第一步中,计算和排序每个基元的密码(unsigned int)。本教程给出了12K对象的morton代码计算和排序的参考时间成本:

I'm a newbie to GPU programming. Recently, I'm trying to implement the gpu bvh construction algorithm based on an tutorial: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/. In the first step of this algorithm, the morton code(unsigned int) of every primitive is computed and sorted. The tutorial gives a reference time cost of the morton code computing and sorting for 12K objects:


  1. 0.02 ms,每个对象一个线程:计算边界

  2. 0.18 ms,平行基数排序:根据对象的Morton代码对对象进行排序。
    ...



在我的实现中,第一步的成本为0.1ms,排序步骤的成本为1.8ms。我使用推力做排序。那么什么是GPU上基数排序的最快实现?

In my implementation, the first step cost 0.1ms and the sorting step costs 1.8ms. I'm using thrust to do the sorting. So what is the fastest implementation of radix sort on GPU?

我使用的Geforce Titan GPU应该比本教程作者使用的Geforce GTX690更快。
这是我的测试代码排序,即使大小为10,它花费大约1.5ms。

I'm using a Geforce Titan GPU which should faster than the Geforce GTX690 used by the author of the tutorial. Here is my test code for sorting, it costs about 1.5ms even when the size is 10.

void testSort()
{
    int sz = 10;
    thrust::host_vector<unsigned int> h_keys(sz);
    for(int i=0; i<sz; i++)
        h_keys[i] = rand();
    thrust::device_ptr<unsigned int> keys = thrust::device_malloc<unsigned int>(sz);
    thrust::copy(h_keys.begin(),h_keys.end(),keys);
    cudaEvent_t estart, estop;
    cudaEventCreate( &estart );
    cudaEventCreate( &estop );
    cudaEventRecord( estart, 0 );
    thrust::stable_sort(keys,keys+sz);
    cudaEventRecord( estop, 0 ) ;
    cudaEventSynchronize( estop );
    float elapsedTime;
    cudaEventElapsedTime( &elapsedTime,
        estart, estop ) ;
    printf( "Time to sort: %3.1f ms\n", elapsedTime );
    cudaEventDestroy( estart ) ;
    cudaEventDestroy( estop ) ;
}


推荐答案

for GPGPU by back40computing 。他们提供了一个性能比较图表,他们声称他们的实施是最快的。

There is a Radix sort implementation for GPGPU by back40computing. They provide a performance comparison chart with which they claim that their implementation is the fastest.

这篇关于如何快速推力::排序和什么是最快的基数排序实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆