在CUDA中按键对(小)数组进行排序 [英] Sorting (small) arrays by key in CUDA

查看:121
本文介绍了在CUDA中按键对(小)数组进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个函数,该函数需要一块未排序的键/值对,例如

I'm trying to write a function that takes a block of unsorted key/value pairs such as

<7, 4>
<2, 8>
<3, 1>
<2, 2>
<1, 5>
<7, 1>
<3, 8>
<7, 2>

并按键对它们进行排序,同时减少具有相同键的对的值:

and sorts them by key while reducing the values of pairs with the same key:

<1, 5>
<2, 10>
<3, 9>
<7, 7>

当前,我正在使用__device__函数,如下所示,它本质上是一种双调排序,它将组合相同键的值并将旧数据设置为无限大的值(现在仅使用99 ),以便随后的双音排序将它们筛选到底部,并删除按int *值剪切的数组.

Currently, I'm using a __device__ function like the one below which is essentially a bitonic sort that will combine values of the same key and set the old data to an infinitely large value (just using 99 for now) so that a subsequent bitonic sort will sift them to the bottom and the array cut by the value of int * removed.

__device__ void interBitonicSortReduce(int2 *sdata, int tid, int recordNum, int *removed) {
  int n = MIN(DEFAULT_DIMBLOCK, recordNum);
  for (int k = 2; k <= n; k *= 2) {
    for (int j = k / 2; j > 0; j /= 2) {
      int ixj = tid ^ j;
      if (ixj > tid) {
        if (sdata[tid].x == sdata[ixj].x && sdata[tid].x < 99) {
          atomicAdd(&sdata[tid].y, sdata[ixj].y);
          sdata[ixj].x = 99; 
          sdata[ixj].y = 99; 
          atomicAdd(removed, 1); 
        }   
        if ((tid & k) == 0 && sdata[tid].x > sdata[ixj].x)
          swapData2(sdata[tid], sdata[ixj]);
        if ((tid & k) != 0 && sdata[tid].x < sdata[ixj].x)
          swapData2(sdata[tid], sdata[ixj]);
        __syncthreads();
      }   
    }   
  }
}

这仅适用于少量数据集,但对于较大的数据集(尽管仍在单个块的大小之内),单次调用将无法完成.

This works just fine for small sets of data but with larger sets (though still within the size of a single block) a single call just won't do it.

尝试将排序和归约合并到同一函数中是否明智?显然,该函数需要多次调用,但是是否可以根据其大小确定要调用多少次以耗尽所有数据呢?

Is it wise to try to combine the sorting and the reduction in the same function? Obviously the function would need to be called more than once but is it possible to determine exactly how many times it needs to be called to exhaust all the data based on its size?

或者我应该用类似这样的方法分别执行减少操作:

Or should I preform the reduction separately with something like this:

__device__ int interReduce(int2 *sdata, int tid) {
  int index = tid;
  while (sdata[index].x == sdata[tid].x) {
    index--;
    if (index < 0)
      break;
  }
  if (index+1 != tid) {
    atomicAdd(&sdata[index+1].y, sdata[tid].y);
    sdata[tid].x = 99;
    sdata[tid].y = 99;
    return 1;
  }
  return 0;
}

我试图提供最有效的解决方案,但是我在CUDA和并行算法方面的经验有限.

I'm trying to come up with the most efficient solution, but my experience with CUDA and parallel algorithms is limited.

推荐答案

您可以使用推力来做到这一点.

You can use thrust to do this.

使用 thrust :: sort_by_key 后跟这是一个例子:

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>

#define N 12
typedef thrust::device_vector<int>::iterator dintiter;
int main(){

  thrust::device_vector<int> keys(N);
  thrust::device_vector<int> values(N);
  thrust::device_vector<int> new_keys(N);
  thrust::device_vector<int> new_values(N);
  thrust::sequence(keys.begin(), keys.end());
  thrust::sequence(values.begin(), values.end());

  keys[3] = 1;
  keys[9] = 1;
  keys[8] = 2;
  keys[7] = 4;

  thrust::sort_by_key(keys.begin(), keys.end(), values.begin());
  thrust::pair<dintiter, dintiter> new_end;
  new_end = thrust::reduce_by_key(keys.begin(), keys.end(), values.begin(), new_keys.begin(), new_values.begin());

  std::cout << "results  values:" << std::endl;
  thrust::copy(new_values.begin(), new_end.second, std::ostream_iterator<int>( std::cout, " "));
  std::cout << std::endl << "results keys:" << std::endl;
  thrust::copy(new_keys.begin(), new_end.first, std::ostream_iterator<int>( std::cout, " "));
  std::cout << std::endl;

  return 0;
}

这篇关于在CUDA中按键对(小)数组进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆