制作关键出现的次数用大小完全相同CUDA /推力 [英] Making the number of key occurances equal using CUDA / Thrust

查看：207 发布时间：2016/6/3 21:31:17 c++ arrays cuda gpu thrust

本文介绍了制作关键出现的次数用大小完全相同CUDA /推力的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有采取排序的键/值对数组，并确保每个按键都有使用CUDA推力库元素相等数量的有效途径？

例如，假设我们有下面的两个数组：

  ID：1 2 2 3 3 3
VN：6 7 8 5 7 8

如果我们想有两个每个键的出现，这将是结果是：

  ID：2 2 3 3
VN：7 8 5 7

实际的阵列会大很多，包含数百万元以上。我能够做到这一点使用嵌套的for循环很容易，但我想知道是否有对使用GPU的转换阵列更有效的方式。推力好像它可能是有用的，但我没有看到任何明显的功能使用。

感谢您的帮助！

解决方案

警告：如果这是您计划在GPU上做的唯一的操作，我不会推荐它。将数据复制到/从GPU将有可能从使用GPU大于任何可能的效率/性能优势的成本。

编辑：的基础上的评论，该序列门槛可能比2更长的时间，我会建议一个替代方法（方法2）应该是更有效的比for循环或者穷举法（方法1）。

在一般我会放这个问题在一个名为流压实类别。流压实一般指的是把数据序列并将其减少至较小的数据序列。

如果我们在推流压实面积看，可以作出一种算法对于这个问题是工作的推力:: copy_if（）的（特别是，为方便起见，将采用一个模版阵列版本）。

方法1：

要想想并行这个问题，我们必须扪心自问，在什么条件下应该给定元素从输入到输出复制？如果我们能正式这样的逻辑，我们可以构造一个推力函子，我们可以传递给推力:: copy_if 来指导它作为复制的元素。

对于给定的元素，序列长度= 2的情况下，我们可以构造一个完整的逻辑，如果我们知道：

元素

元素一个位置向右

元素一个地方向左

元素两地向左

根据上述情况，我们将需要拿出那些其中任何物品2,3或4以上都是不确定的因素特殊情况的逻辑。

忽略的特殊情况下，如果我们知道上面的4个项目，那么我们可以构造必要的逻辑如下：

如果该元素添加到我的左边是和我一样，但是元件两个地方的左边是不同的，那么我属于在输出

如果该元素添加到我的左边是比我不同，但该元素在我右边的是和我一样，我属于在输出

否则，我不输出属于

我将让你来构建必要的逻辑的特殊情况。（或从我所提供的code反向工程）。

方法二：

有关长序列，方法1或方法1中的逻辑的一个for循环变体将产生至少1读出的数据每序列长度的元件设置的。对于长序列（例如，2000），这将是低效率的。因此，另一种可行的方法是如下：

生成一个<一个href=\"http://thrust.github.io/doc/group__segmented$p$pfixsums.html#ga115ffd69c3e45736772cbdf2b74b9097\"相对=nofollow> exclusive_scan_by_key 在正向和反向的方向，使用ID值作为键和推力:: constant_iterator （值= 1）作为用于扫描的值。对于给定的数据集，创建这样的中间结果：
```
  ID：1 2 2 3 3 3
VN：6 7 8 5 7 8
FS：0 0 1 0 1 2
RS：0 1 0 2 1 0
 
```

其中Fs和RS是向前的结果和反向扫描逐键。我们产生反向扫描使用（RS） .rbegin（）和 .rend（）的反向迭代器的。注意，这有许多工作要做既为反向扫描的输入和输出，以便产生如上述的RS序列

<醇开始=2>

我们的推力:: copy_if的逻辑仿函数就变成非常简单。该元素的RS和FS的值对于给定的元件，若<青霉>总和的是大于或等于所需的最小序列长度（-1占独占扫描操作）的和的FS的值小于所需的最小序列长度，则该元素属于在输出中。

下面是两种方法的整个例子，使用给定的数据，序列长度2：

  $猫t1095.cu
＃包括LT＆;推力/ device_vector.h＆GT;
＃包括LT＆;推力/ copy.h＆GT;
＃包括LT＆;推力/迭代器/ counting_iterator.h＆GT;
＃包括LT＆;推力/迭代器/ zip_iterator.h＆GT;
＃包括LT＆;＆iostream的GT;＃包括LT＆;推力/ scan.h＆GT;
＃包括LT＆;推力/迭代器/ constant_iterator.h＆GT;结构copy_func
{
  INT * D;
  INT DSIZE，R，L，M，L2;
  copy_func为（int * _d，诠释_dsize）：D（_D），DSIZE（_dsize）{};
  __host__ __device__
  布尔运算符（）（INT IDX）
  {
    M = D [IDX];
    //处理典型案例
    //这个逻辑可以通过一个for循环被替换为任意长度的序列
    如果（（IDX→1）及及（IDX＆下; DSIZE-1））{
      R = D [idx的+ 1〕;
      L = D [idx的-1];
      L2 = D [idx的-2];
      如果（（R == M）放大器;＆安培;！（M = 1））返回true;
      如果（（L == M）放大器;及（M = 12）！）返回true;
      返回false;}
    //处理特殊情况
    如果（IDX == 0）{
      R = D [idx的+ 1〕;
      回归（R = = M）;}
    如果（IDX == 1）{
      R = D [idx的+ 1〕;
      L = D [idx的-1];
      如果（L == M）返回true;
      否则，如果（R == M）返回true;
      返回false;}
    如果（IDX == DSIZE-1）{
      L = D [idx的-1];
      L2 = D [idx的-2];
      如果（（M == 1）放大器;＆安培;！（M = 12））返回true;
      返回false;}
    //可以把断言（0）在这里，不应该到这里来
    返回false;
  }
};结构copy_func2
{
  INT反复做;
  copy_func2（INT _thresh）：脱粒（_thresh）{};
  模板＆LT; typename的T＆GT;
  __host__ __device__
  布尔运算符（）（T T）{
    收益率（（（推力::获得℃的＆GT;（T）+推力::获得＆LT; 1＆GT;（T））＆GT; =（脱粒-1））及及（推力::获得℃的＆GT;（T） ＆LT;脱粒））;
  }
};诠释主（）{  const int的length_threshold = 2;
  INT ID [] = {1,2,2,3,3,3};
  INT VN [] = {6,7,8,5,7,8};
  INT DSIZE = sizeof的（ID）/的sizeof（int）;在
  //我们假设DSIZE＆GT; 3
  推力:: device_vector＆LT;＆诠释GT;编号（ID，ID + DSIZE）;
  推力:: device_vector＆LT;＆诠释GT; VN（VN，VN + DSIZE）;  推力:: device_vector＆LT;＆诠释GT; RES_ID（DSIZE）;
  推力:: device_vector＆LT;＆诠释GT; res_vn（DSIZE）;
  推力:: counting_iterator＆LT;＆诠释GT; IDX（0）;  //方法1：2序列长度阈  INT RSIZE =推力:: copy_if（推力:: make_zip_iterator（推力:: make_tuple（id.begin（），vn.begin（））），推力:: make_zip_iterator（推力:: make_tuple（id.end（），VN。端（））），IDX，推力:: make_zip_iterator（推力:: make_tuple（res_id.begin（），res_vn.begin（））），copy_func（推力:: raw_pointer_cast（id.data（）），DSIZE）） - 推力:: make_zip_iterator（推力:: make_tuple（res_id.begin（），res_vn.begin（）））;  性病::法院LT＆;＆LT; ID： ;
  推力:: copy_n（res_id.begin（），RSIZE，的std :: ostream_iterator＆LT; INT＆GT;（STD ::法院））;
  性病::法院LT＆;＆LT;的std :: ENDL＆LT;＆LT; VN;
  推力:: copy_n（res_vn.begin（），RSIZE，的std :: ostream_iterator＆LT; INT＆GT;（STD ::法院））;
  性病::法院LT＆;＆LT;的std :: ENDL;  //方法2：任意序列长度阈值
  推力:: device_vector＆LT;＆诠释GT; res_fs（DSIZE）;
  推力:: device_vector＆LT;＆诠释GT; res_rs（DSIZE）;
  推力:: exclusive_scan_by_key（id.begin（），id.end（），推力:: constant_iterator＆所述; INT＆GT;（1），res_fs.begin（））;
  推力:: exclusive_scan_by_key（id.rbegin（），id.rend（），推力:: constant_iterator＆所述; INT＆GT;（1），res_rs.begin（））;
  RSIZE =推力:: copy_if（推力:: make_zip_iterator（推力:: make_tuple（id.begin（），vn.begin（））），推力:: make_zip_iterator（推力:: make_tuple（id.end（），vn.end （））），推力:: make_zip_iterator（推力:: make_tuple（res_fs.begin（），res_rs.rbegin（））），推力:: make_zip_iterator（推力:: make_tuple（res_id.begin（），res_vn.begin（） ）），copy_func2（length_threshold）） - 推力:: make_zip_iterator（推力:: make_tuple（res_id.begin（），res_vn.begin（）））;  性病::法院LT＆;＆LT; ID： ;
  推力:: copy_n（res_id.begin（），RSIZE，的std :: ostream_iterator＆LT; INT＆GT;（STD ::法院））;
  性病::法院LT＆;＆LT;的std :: ENDL＆LT;＆LT; VN;
  推力:: copy_n（res_vn.begin（），RSIZE，的std :: ostream_iterator＆LT; INT＆GT;（STD ::法院））;
  性病::法院LT＆;＆LT;的std :: ENDL;
  返回0;
}$ NVCC -o t1095 t1095.cu
$ ./t1095
编号：2 2 3 3
VN：7 8 5 7
编号：2 2 3 3
VN：7 8 5 7

注：

的
copy_func 实现为方法1，接收该元素的索引（通过模版）一个给定的元件的测试逻辑，以及一个指针在设备上的 ID 数据，并且该数据的大小，通过算符初始化参数。变量研究， M ，→和 12 引用单元我的权利，我的元素我左边，和元素两地在我左边，分别为。

我们传递一个指向 ID 数据到仿函数。这允许仿函数来检索（最多），用于测试逻辑4必需的元件。这避免了推力:: zip_iterator的凌乱建设提供所有这些值。注意，在仿这些元素的读取应该很好地聚结，因此，是相当有效的，并且还从缓存中受益。

我并不认为这是无缺陷。我想我得到了测试逻辑正确，但它是可能的，我没有。你应该核实code的那部分的逻辑正确性，至少。我的目的不是给你一个黑箱件code，但来演示如何思考这个问题你的方式。

这做法可能会得到比2.更长的密钥序列，在这种情况下，我建议的方法2.（如果你已经有循环顺序实现必要的逻辑，你可以放弃繁琐修改该版本到方法1仿函数更长的键序列。这样的for循环应该可能仍然从缓存中聚结访问和邻近的访问中获益。）

Is there an efficient way to take a sorted key/value array pair and ensure that each key has an equal number of elements using the CUDA Thrust library?

For instance, assume we have the following pair of arrays:

ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8

If we want to have two of each key appear, this would be the result:

ID: 2 2 3 3
VN: 7 8 5 7

The actual arrays will be much larger, containing millions of elements or more. I'm able to do this using nested for-loops easily, but I'm interested in knowing whether or not there's a more efficient way to convert the arrays using a GPU. Thrust seems as though it may be useful, but I don't see any obvious functions to use.

Thank you for your help!

解决方案

Caveat: If this is the only operation you plan to do on the GPU, I would not recommend it. The cost to copy the data to/from the GPU will likely outweigh any possible efficiency/performance benefit from using the GPU.

EDIT: based on the comments that the sequence threshold is likely to be much longer than 2, I'll suggest an alternate method (method 2) that should be more efficient than a for-loop or brute-force method (method 1).

In general I would place this problem in a category called stream compaction. Stream compaction generally refers to taking a sequence of data and reducing it to a smaller sequence of data.

If we look in the thrust stream compaction area, an algorithm that could be made to work for this problem is thrust::copy_if() (in particular, for convenience, the version that takes a stencil array).

method 1:

To think about this problem in parallel, we must ask ourselves under what condition should a given element be copied from the input to the output? If we can formalize this logic, we can construct a thrust functor which we can pass to thrust::copy_if to instruct it as to which elements to copy.

For a given element, for the sequence length = 2 case, we can construct a complete logic if we know:

the element
the element one place to the right
the element one place to the left
the element two places to the left

Based on the above, we will need to come up with "special case" logic for those elements for which any of the items 2,3, or 4 above are undefined.

Ignoring the special cases, if we know the above 4 items, then we can construct the necessary logic as follows:

If the element to my left is the same as me, but the element two places to the left is different, then I belong in the output
If the element to my left is different than me, but the element to my right is the same as me, I belong in the output
Otherwise, I don't belong in the output

I'll leave it to you to construct the necessary logic for the special cases. (Or reverse-engineer it from the code I've provided).

method 2:

For long sequences, method 1 or a for-loop variant of the logic in method 1 will generate at least 1 read of the data set per element of the sequence length. For a long sequence (e.g. 2000) this will be inefficient. Therefore another possible approach would be as follows:

Generate an exclusive_scan_by_key in both forward and reverse directions, using the ID values as the key, and a thrust::constant_iterator (value=1) as the values for the scan. For the given data set, that creates intermediate results like this:
```
ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8
FS: 0 0 1 0 1 2
RS: 0 1 0 2 1 0
```

where FS and RS are the results of the forward and reverse scan-by-key. We generate the reverse scan (RS) using .rbegin() and .rend() reverse iterators. Note that this has to be done both for the reverse scan input and output, in order to generate the RS sequence as above.

The logic for our thrust::copy_if functor then becomes fairly simple. For a given element, if the sum of the RS and FS value for that element is greater than or equal to the desired minimum sequence length (-1 to account for exclusive scan operation) and the FS value is less than the desired minimum sequence length, then that element belongs in the output.

Here's a fully worked example of both methods, using the given data, for sequence length 2:

$ cat t1095.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <iostream>

#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>

struct copy_func
{
  int *d;
  int dsize, r, l, m, l2;
  copy_func(int *_d, int _dsize) : d(_d),dsize(_dsize) {};
  __host__ __device__
  bool operator()(int idx)
  {
    m = d[idx];
    // handle typical case
    // this logic could be replaced by a for-loop for sequences of arbitrary length
    if ((idx > 1) && (idx < dsize-1)){
      r = d[idx+1];
      l = d[idx-1];
      l2 = d[idx-2];
      if ((r == m) && (m != l))  return true;
      if ((l == m) && (m != l2)) return true;
      return false;}
    // handle special cases
    if (idx == 0){
      r = d[idx+1];
      return (r == m);}
    if (idx == 1){
      r = d[idx+1];
      l = d[idx-1];
      if (l == m) return true;
      else if (r == m) return true;
      return false;}
    if (idx == dsize-1){
      l = d[idx-1];
      l2 = d[idx-2];
      if ((m == l) && (m != l2)) return true;
      return false;}
    // could put assert(0) here, should never get here
    return false;
  }
};

struct copy_func2
{
  int thresh;
  copy_func2(int _thresh) : thresh(_thresh) {};
  template <typename T>
  __host__ __device__
  bool operator()(T t){
    return (((thrust::get<0>(t) + thrust::get<1>(t))>=(thresh-1)) && (thrust::get<0>(t) < thresh));
  }
};

int main(){

  const int length_threshold = 2;
  int ID[] = {1,2,2,3,3,3};
  int VN[] = {6,7,8,5,7,8};
  int dsize = sizeof(ID)/sizeof(int);
  // we assume dsize > 3
  thrust::device_vector<int> id(ID, ID+dsize);
  thrust::device_vector<int> vn(VN, VN+dsize);

  thrust::device_vector<int> res_id(dsize);
  thrust::device_vector<int> res_vn(dsize);
  thrust::counting_iterator<int> idx(0);

  //method 1: sequence length threshold of 2

  int rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), idx,  thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func(thrust::raw_pointer_cast(id.data()), dsize)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));

  std::cout << "ID: ";
  thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
  std::cout << std::endl << "VN: ";
  thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
  std::cout << std::endl;

  //method 2: for arbitrary sequence length threshold
  thrust::device_vector<int> res_fs(dsize);
  thrust::device_vector<int> res_rs(dsize);
  thrust::exclusive_scan_by_key(id.begin(), id.end(), thrust::constant_iterator<int>(1), res_fs.begin());
  thrust::exclusive_scan_by_key(id.rbegin(), id.rend(), thrust::constant_iterator<int>(1), res_rs.begin());
  rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), thrust::make_zip_iterator(thrust::make_tuple(res_fs.begin(), res_rs.rbegin())),  thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func2(length_threshold)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));

  std::cout << "ID: ";
  thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
  std::cout << std::endl << "VN: ";
  thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t1095 t1095.cu
$ ./t1095
ID: 2 2 3 3
VN: 7 8 5 7
ID: 2 2 3 3
VN: 7 8 5 7

Notes:

the copy_func implements the test logic for a given element for method 1. It receives the index of that element (via the stencil) as well as a pointer to the ID data on the device, and the size of the data, via functor initialization parameters. The variables r, m, l, and l2 refer to the element to my right, myself, the element to my left, and the element two places to my left, respectively.
we are passing a pointer to the ID data to the functor. This allows the functor to retrieve the (up to) 4 necessary elements for the test logic. This avoids a messy construction of a thrust::zip_iterator to provide all these values. Note that the reads of these elements in the functor should coalesce nicely, and therefore be fairly efficient, and also benefit from the cache.
I don't claim that this is defect-free. I think I got the test logic right, but it's possible I didn't. You should verify the logical correctness of that portion of the code, at least. My purpose is not to give you a black-box piece of code, but to demonstrate how to think your way through the problem.
This approach may get cumbersome for key sequences longer than 2. In that case I would suggest method 2. (If you already have a sequential for-loop that implements the necessary logic, you may be able to drop a modified version of that into the method 1 functor for longer key sequences. Such a for-loop should probably still benefit from coalesced access and adjacent accesses from the cache.)

这篇关于制作关键出现的次数用大小完全相同CUDA /推力的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

制作关键出现的次数用大小完全相同CUDA /推力 [英] Making the number of key occurances equal using CUDA / Thrust

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

制作关键出现的次数用大小完全相同CUDA /推力 [英] Making the number of key occurances equal using CUDA / Thrust

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭