有条件地减少CUDA [英] Conditional reduction in CUDA
问题描述
我需要总结存储在数组中但包含条件的 100000
值。
I need to sum about 100000
values stored in an array, but with conditions.
任何人都可以发布一个小代码来实现这个目的?
Can anyone post a small code to do that?
推荐答案
我认为,为了执行条件压缩,你可以直接引用条件作为乘法 0
(false) c $ c> 1 (true)到加数。换句话说,假设你想要满足的条件是加数小于 10.f
。在这种情况下,请在优化CUDA中的并行缩减方面借用第一个代码。哈里斯,那么上面的意思是
I think that, to perform conditional reduction, you can directly introduce the condition as a multiplication by 0
(false) or 1
(true) to the addends. In other words, suppose that the condition you would like to meet is that the addends be smaller than 10.f
. In this case, borrowing the first code at Optimizing Parallel Reduction in CUDA by M. Harris, then the above would mean
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]*(g_data[i]<10.f);
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
如果您希望使用CUDA Thrust执行有条件还原,同样使用 thrust :: transform_reduce
。或者,您可以创建一个新的向量 d_b
复制,因为 d_a
的所有元素满足 thrust :: copy_if
,然后在 d_b
上应用 thrust :: reduce
。我没有检查哪个解决方案执行最好的。也许,第二个解决方案将在稀疏数组上表现更好。下面是实现这两种方法的示例。
If you wish to use CUDA Thrust to perform conditional reduction, you can do the same by using thrust::transform_reduce
. Alternatively, you can create a new vector d_b
copying in that all the elements of d_a
satisfying the predicate by thrust::copy_if
and then applying thrust::reduce
on d_b
. I haven't checked which solution performs the best. Perhaps, the second solution will perform better on sparse arrays. Below is an example with an implementation of both the approaches.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/count.h>
#include <thrust/copy.h>
// --- Operator for the first approach
struct conditional_operator {
__host__ __device__ float operator()(const float a) const {
return a*(a<10.f);
}
};
// --- Operator for the second approach
struct is_smaller_than_10 {
__host__ __device__ bool operator()(const float a) const {
return (a<10.f);
}
};
void main(void)
{
int N = 20;
// --- Host side allocation and vector initialization
thrust::host_vector<float> h_a(N,1.f);
h_a[0] = 20.f;
h_a[1] = 20.f;
// --- Device side allocation and vector initialization
thrust::device_vector<float> d_a(h_a);
// --- First approach
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
printf("Result = %f\n",sum);
// --- Second approach
int N_prime = thrust::count_if(d_a.begin(), d_a.end(), is_smaller_than_10());
thrust::device_vector<float> d_b(N_prime);
thrust::copy_if(d_a.begin(), d_a.begin() + N, d_b.begin(), is_smaller_than_10());
sum = thrust::reduce(d_b.begin(), d_b.begin() + N_prime, 0.f);
printf("Result = %f\n",sum);
getchar();
}
这篇关于有条件地减少CUDA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!