CUDA Thrust:reduce_by_key仅对数组中的一些值,基于“key”值中的值。数组 [英] CUDA Thrust: reduce_by_key on only some values in an array, based off values in a "key" array
问题描述
假设我有两个device_vector< byte>数组 d_keys
和 d_data
。
Let's say I have two device_vector<byte> arrays, d_keys
and d_data
.
c $ c> d_data 是例如平坦的2D 3x5阵列(例如{1,2,3,4,5,6,7,8,9,8,7,6,5 ,4,3})和 d_keys
是大小为5的1D数组(例如{1,0,0,1,1}),如何进行缩减如果相应的 d_keys
值为1(例如,结果为{10,23,14}),我最终只会在每行的基础上添加值。 )?
If d_data
is, for example, a flattened 2D 3x5 array ( e.g. { 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3 } ) and d_keys
is a 1D array of size 5 ( e.g. { 1, 0, 0, 1, 1 } ), how can I do a reduction such that I'd end up only adding values on a per-row basis if the corresponding d_keys
value is one ( e.g. ending up with a result of { 10, 23, 14 } )?
sum_rows.cu 示例允许我添加 d_data
中的每个值,但这不太对。
The sum_rows.cu example allows me to add every value in d_data
, but that's not quite right.
或者,我可以在每行的基础上使用 zip_iterator
并将 d_keys
与一个一次 d_data
的行,并执行 transform_reduce
,只有当键值为1时才添加,必须循环遍历 d_data
数组。
Alternatively, I can, on a per-row basis, use a zip_iterator
and combine d_keys
with one row of d_data
at a time, and do a transform_reduce
, adding only if the key value is one, but then I'd have to loop through the d_data
array.
我真正需要的是某种<$ c $
What I really need is some sort of transform_reduce_by_key
functionality that isn't built-in, but surely there must be a way to make it!
推荐答案
基于额外的注释,而不是3行有成千上万的行,我们可以写一个变换函子,总和整个行。基于事实,有成千上万的行,这应该保持机器很忙:
Based on the additional comment that instead of 3 rows there are thousands of rows, we can write a transform functor that sums an entire row. Based on the fact that there are thousands of rows, this should keep the machine pretty busy:
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/fill.h>
#define ROW 20
#define COL 10
__device__ int *vals;
__device__ int *keys;
struct test_functor
{
const int a;
test_functor(int _a) : a(_a) {}
__device__
int operator()(int& x, int& y ) {
int temp = 0;
for (int i = 0; i<a; i++)
temp += vals[i + (y*a)] * keys[i];
return temp;
}
};
int main(){
int *s_vals, *s_keys;
thrust::host_vector<int> h_vals(ROW*COL);
thrust::host_vector<int> h_keys(COL);
thrust::sequence(h_vals.begin(), h_vals.end());
thrust::fill(h_keys.begin(), h_keys.end(), 1);
h_keys[0] = 0;
thrust::device_vector<int> d_vals = h_vals;
thrust::device_vector<int> d_keys = h_keys;
thrust::device_vector<int> d_sums(ROW);
thrust::fill(d_sums.begin(), d_sums.end(), 0);
s_vals = thrust::raw_pointer_cast(&d_vals[0]);
s_keys = thrust::raw_pointer_cast(&d_keys[0]);
cudaMemcpyToSymbol(vals, &s_vals, sizeof(int *));
cudaMemcpyToSymbol(keys, &s_keys, sizeof(int *));
thrust::device_vector<int> d_idx(ROW);
thrust::sequence(d_idx.begin(), d_idx.end());
thrust::transform(d_sums.begin(), d_sums.end(), d_idx.begin(), d_sums.begin(), test_functor(COL));
thrust::host_vector<int> h_sums = d_sums;
std::cout << "Results :" << std::endl;
for (unsigned i = 0; i<ROW; i++)
std::cout<<"h_sums["<<i<<"] = " << h_sums[i] << std::endl;
return 0;
}
这种方法的缺点是一般访问 vals
数组不会合并。然而对于几千行,高速缓存可以提供显着的缓解。我们可以通过重新排序将数据存储在扁平数组中的列主表单中来修复这个问题,并在函子中的循环中改变我们的索引方法,如下所示:
This approach has the drawback that in general accesses to the vals
array will not be coalesced. However for a few thousand rows the cache may offer significant relief. We can fix this problem by re-ordering the data to be stored in column-major form in the flattened array, and change our indexing method in the loop in the functor to be like this:
for (int i=0; i<a; i++)
temp += vals[(i*ROW)+y]*keys[i];
如果愿意,您可以将ROW作为附加参数传递给函子。
If preferred, you can pass ROW as an additional parameter to the functor.
这篇关于CUDA Thrust:reduce_by_key仅对数组中的一些值,基于“key”值中的值。数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!