如何使用Thrust对矩阵的行进行排序? [英] How to use Thrust to sort the rows of a matrix?

查看:304
本文介绍了如何使用Thrust对矩阵的行进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个5000x500矩阵,我想用cuda分别对每一行进行排序。我可以使用arrayfire但这只是一个for循环thrust :: sort,这不应该是有效的。

I have a 5000x500 matrix and I want to sort each row separately with cuda. I can use arrayfire but this is just a for loop over the thrust::sort, which should not be efficient.

https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/ kernel / sort.hpp

for(dim_type w = 0; w < val.dims[3]; w++) {
            dim_type valW = w * val.strides[3];
            for(dim_type z = 0; z < val.dims[2]; z++) {
                dim_type valWZ = valW + z * val.strides[2];
                for(dim_type y = 0; y < val.dims[1]; y++) {

                    dim_type valOffset = valWZ + y * val.strides[1];

                    if(isAscending) {
                        thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0]);
                    } else {
                        thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0],
                                     thrust::greater<T>());
                    }
                }
            }
        }

有一种方法来融合在推力的操作,以便有并行运行?事实上,我寻找的是一个通用的方法来融合循环迭代到

Is there a way to fuse operations in thrust so as to have the sorts run in parallel? Indeed, what I am looking for is a generic way to fuse for loop iterations into.

推荐答案

其中一个建议已经由@JaredHoberock。我不知道一个一般的方法来融合循环迭代推力,但第二种方法是更一般的方法。我猜想第一种方法是两种方法中更快的,在这种情况下。

I can think of 2 possibilities, one of which is suggested already by @JaredHoberock. I don't know of a general methodology to fuse for-loop iterations in thrust, but the second method is the more general approach. My guess is that the first method would be the faster of the two approaches, in this case.


  1. 使用向量化排序。如果要通过嵌套for循环排序的区域不重叠,则可以使用如下所述的2个背靠背稳定排序操作进行向量化排序此处

Thrust v1.8 CUDA 7 RC,或通过直接从推力github存储库下载,包括支持嵌套推力算法,通过在自定义函子中包含一个推力算法调用传递给另一个推力算法如果您使用 thrust :: for_each 操作选择您需要执行的各种排序,您可以通过单个推力算法调用运行这些排序,包括推力::

Thrust v1.8 (available with CUDA 7 RC, or via direct download from the thrust github repository includes support for nesting thrust algorithms, by including a thrust algorithm call within a custom functor passed to another thrust algorithm. If you use the thrust::for_each operation to select the individual sorts you need to perform, you can run those sorts with a single thrust algorithm call, by including the thrust::sort operation in the functor you pass to thrust::for_each.

在你传递给 thrust :: for_each的函数中执行:: sort >

以下是3种方法之间的完全比较:

Here's a fully worked comparison between 3 methods:


  1. 原始sort-in-a -loop方法

  2. 向量化/批处理排序

  3. 嵌套排序


$ b

In each case, we're sorting the same 16000 sets of 1000 ints each.

$ cat t617.cu
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>

#define NSORTS 16000
#define DSIZE 1000

int my_mod_start = 0;
int my_mod(){
  return (my_mod_start++)/DSIZE;
}

bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2){
  return thrust::equal(d1.begin(), d1.end(), d2.begin());
}


struct sort_functor
{
  thrust::device_ptr<int> data;
  int dsize;
  __host__ __device__
  void operator()(int start_idx)
  {
    thrust::sort(thrust::device, data+(dsize*start_idx), data+(dsize*(start_idx+1)));
  }
};



#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main(){
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16*DSIZE*NSORTS));
  thrust::host_vector<int> h_data(DSIZE*NSORTS);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  thrust::device_vector<int> d_data = h_data;

  // first time a loop
  thrust::device_vector<int> d_result1 = d_data;
  thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
  unsigned long long mytime = dtime_usec(0);
  for (int i = 0; i < NSORTS; i++)
    thrust::sort(r1ptr+(i*DSIZE), r1ptr+((i+1)*DSIZE));
  cudaDeviceSynchronize();
  mytime = dtime_usec(mytime);
  std::cout << "loop time: " << mytime/(float)USECPSEC << "s" << std::endl;

  //vectorized sort
  thrust::device_vector<int> d_result2 = d_data;
  thrust::host_vector<int> h_segments(DSIZE*NSORTS);
  thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
  thrust::device_vector<int> d_segments = h_segments;
  mytime = dtime_usec(0);
  thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
  thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
  cudaDeviceSynchronize();
  mytime = dtime_usec(mytime);
  std::cout << "vectorized time: " << mytime/(float)USECPSEC << "s" << std::endl;
  if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
  //nested sort
  thrust::device_vector<int> d_result3 = d_data;
  sort_functor f = {d_result3.data(), DSIZE};
  thrust::device_vector<int> idxs(NSORTS);
  thrust::sequence(idxs.begin(), idxs.end());
  mytime = dtime_usec(0);
  thrust::for_each(idxs.begin(), idxs.end(), f);
  cudaDeviceSynchronize();
  mytime = dtime_usec(mytime);
  std::cout << "nested time: " << mytime/(float)USECPSEC << "s" << std::endl;
  if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
  return 0;
}
$ nvcc -arch=sm_20 -std=c++11 -o t617 t617.cu
$ ./t617
loop time: 8.51577s
vectorized time: 0.068802s
nested time: 0.567959s
$

注意: / p>

Notes:


  1. 这些结果在GPU和GPU之间会有很大的不同。

  2. 嵌套时间/在可支持动态并行性的GPU上显着变化,因为这将影响推力如何运行嵌套排序函数。要使用动态并行性测试,请将编译开关从 -arch = sm_20 更改为 -arch = sm_35 -rdc = true -lcudadevrt

  3. 此代码需要CUDA 7 RC。我使用Fedora 20.

  4. 嵌套排序方法也将从设备端分配,因此我们必须大幅增加设备分配堆使用 cudaDeviceSetLimit

  5. 如果您使用的是动态并行性,并且根据您运行的GPU类型,保留的内存量为 cudaDeviceSetLimit 可能需要增加8倍。

  1. These results will vary significantly from GPU to GPU.
  2. The "nested" time/method may vary significantly on a GPU that can support dynamic parallelism, as this will affect how thrust runs the nested sort functions. To test with dynamic parallelism, change the compile switches from -arch=sm_20 to -arch=sm_35 -rdc=true -lcudadevrt
  3. This code requires CUDA 7 RC. I used Fedora 20.
  4. The nested sort method also will allocate from the device side, therefore we must substantially increase the device allocation heap using cudaDeviceSetLimit.
  5. If you are using dynamic parallelism, and depending on the type of GPU you are running on, the amount of memory reserved with cudaDeviceSetLimit may need to be increased perhaps by an additional factor of 8.

这篇关于如何使用Thrust对矩阵的行进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆