在数组中寻找最小(但跳过某些元素)的CUDA减少使用 [英] Finding the minimum in an array (but skipping some elements) using reduction in CUDA
问题描述
我有一个大阵浮点数,我想找出数组的最小值(忽略 1
取值只要present),以及作为其指标,使用在CUDA减少。我写了下面的code要做到这一点,这在我看来应该工作:
I have a large array of floating point numbers and I want to find out the minimum value of the array (ignoring -1
s wherever present) as well as its index, using reduction in CUDA. I have written the following code to do this, which in my opinion should work:
__global__ void get_min_cost(float *d_Cost,int n,int *last_block_number,int *number_in_last_block,int *d_index){
int tid = threadIdx.x;
int myid = blockDim.x * blockIdx.x + threadIdx.x;
int s;
if(result == (*last_block_number)-1){
s = (*number_in_last_block)/2;
}else{
s = 1024/2;
}
for(;s>0;s/=2){
if(myid+s>=n)
continue;
if(tid<s){
if(d_Cost[myid+s] == -1){
continue;
}else if(d_Cost[myid] == -1 && d_Cost[myid+s] != -1){
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}else{
// both not -1
if(d_Cost[myid]<=d_Cost[myid+s])
continue;
else{
d_Cost[myid] = d_Cost[myid+s];
d_index[myid] = d_index[myid+s];
}
}
}
else
continue;
__syncthreads();
}
if(tid==0){
d_Cost[blockIdx.x] = d_Cost[myid];
d_index[blockIdx.x] = d_index[myid];
}
return;
}
的 last_block_number
参数是最后一个块的ID,而 number_in_last_block
是最后一个块元素的数量(这是一个功率 2
)。因此,所有的块将推出 1024
线程每次和最后一个块将仅使用 number_in_last_block
线程,而其他人将使用 1024
线程。
The last_block_number
argument is the id of the last block, and number_in_last_block
is the number of elements in last block (which is a power of 2
). Thus, all blocks will launch 1024
threads every time and the last block will only use number_in_last_block
threads, while others will use 1024
threads.
在这个函数运行,我期望的最低值,每个块是在 d_Cost [blockIdx.x]
及其 d_index指数[ blockIdx.x]
。
After this function runs, I expect the minimum values for each block to be in d_Cost[blockIdx.x]
and their indices in d_index[blockIdx.x]
.
我多次调用此函数,每次更新线程和块的数量。我第二次调用此函数,线程的数量现在变得等于其余等块数
I call this function multiple times, each time updating the number of threads and blocks. The second time I call this function, the number of threads now become equal to the number of blocks remaining etc.
然而,上述功能不给我所需的输出。事实上,它提供了不同的输出每次运行程序时,即,它会返回一个不正确的值作为一些中间迭代中的最小(尽管不正确的值非常接近,每次最低)。
However, the above function isn't giving me the desired output. In fact, it gives a different output every time I run the program, i.e, it returns an incorrect value as the minimum during some intermediate iteration (though that incorrect value is quite close to the minimum every time).
我在做什么错在这里?
推荐答案
正如我在评论中提及上面的,我会建议,以避免自己的写作削减和使用CUDA推力只要有可能。这甚至在当你需要自定义这些操作的情况下也是如此,定制由是正确的可能超载,例如,关系运算。
As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.
在下面,我提供了一个简单的code数组中的最小值,其指标评价一起。它是基于包含在的简介推力 presentation。唯一增加的跳绳,因为你的要求,从计数 1
的。这可以通过更换合理地完成所有的数组中由 INT_MAX
,即最大重 1
根据IEEE浮点标准presentable整数。
Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1
's from the counting. This can be reasonably done by replacing all the -1
's in the array by INT_MAX
, i.e., the maximum representable integer according to IEEE floating point standards.
#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>
// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
__host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
{
if (a < b)
return a;
else
return b;
}
};
void main() {
const int N = 20;
const int large_value = INT_MAX;
// --- Setting the data vector
thrust::device_vector<int> d_vec(N,10);
d_vec[3] = -1; d_vec[5] = -2;
// --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
thrust::device_vector<int> d_vec_temp(d_vec);
thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);
// --- Creating the index sequence [0, 1, 2, ... )
thrust::device_vector<int> indices(d_vec_temp.size());
thrust::sequence(indices.begin(), indices.end());
// --- Setting the initial value of the search
thrust::tuple<int,int> init(d_vec_temp[0],0);
thrust::tuple<int,int> smallest;
smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
init, smaller_tuple());
printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
getchar();
}
这篇关于在数组中寻找最小(但跳过某些元素)的CUDA减少使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!