检查矩阵在CUDA中是否包含nans或无穷大的值 [英] Checking if a matrix contains nans or infinite values in CUDA

查看:890
本文介绍了检查矩阵在CUDA中是否包含nans或无穷大的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于CUDA(C ++)中的 inf / nan 元素,检查大矩阵的有效方法是什么?该矩阵作为 float * 存储在GPU内存中。我不需要这些元素的位置,只有一个布尔是/否回答如果至少有一个坏的条目存在。



选项是:




  • 有一个内核检查整个数组(容易实现但可能很慢)

  • 有多个内核检查

  • ..其他想法?



谢谢!

解决方案

这里有instrinsics,罚款:

  isnan()

要测试inf,您可以使用:

  isinf()

使用多个内核完成同一个内核的同样工作的速度很快,所以我不知道为什么你会这么想有单个内核会慢。此算法可能是内存限制的,因此您需要专注于读取数据访问效率,即合并。在CUDA中,通过矩阵的简单方法是让每个线程处理一个列。这可以使用for循环有效地实现,并产生完美的合并读取。



由于你只关心一个没有索引的结果,我们可以有多个线程写to a(boolean)result without atomics,为了进一步提高效率,因为任何可能写入结果的线程都将写入相同的值。



另一个可能考虑的优化策略是早期退出策略,但这不会优化最坏情况的时间,但事实上它使得它更长,所以我会放弃,除非平均吞吐量是一个大问题。 / p>

下面是一个完整的工作示例(使用nan的测试为例):

  $ cat t383.cu 
#include< math.h>
#include< stdio.h>
#include< stdlib.h>
#define DSIZEW 10000
#define DSIZEH 2000
#define nTPB 256
#define BLKS 16

__global__ void isnan_test(float * data,int width ,int height,bool * result){

int idx = threadIdx.x + blockDim.x * blockIdx.x;

while(idx for(int i = 0; i if(isnan(data [(i * width)+ idx]))* result = false;
idx + = gridDim.x + blockDim.x;
}
}

int main(){

float * d_data,* h_data;
bool * d_result,h_result = true;
const char type ='0';

cudaMalloc((void **)& d_data,sizeof(float)* DSIZEW * DSIZEH);
cudaMalloc((void **)& d_result,sizeof(bool));
h_data =(float *)malloc(sizeof(float)* DSIZEW * DSIZEH);
for(int i = 0; i h_data [i] = rand()/ RAND_MAX;
cudaMemcpy(d_data,h_data,sizeof(float)* DSIZEW * DSIZEH,cudaMemcpyHostToDevice);
cudaMemcpy(d_result,& h_result,sizeof(bool),cudaMemcpyHostToDevice);
isnan_test<<<< BLKS,nTPB>>>(d_data,DSIZEW,DSIZEH,d_result);
cudaMemcpy(& h_result,d_result,sizeof(bool),cudaMemcpyDeviceToHost);
if(!h_result){printf(no-NAN check\\\
中的错误); return 1;}
float my_nan = nanf(& type); // create a NAN value
cudaMemcpy(d_data,& my_nan,sizeof(float),cudaMemcpyHostToDevice);
isnan_test<<<< BLKS,nTPB>>>(d_data,DSIZEW,DSIZEH,d_result);
cudaMemcpy(& h_result,d_result,sizeof(bool),cudaMemcpyDeviceToHost);
if(h_result){printf(NAN check\\\
中的错误); return 1;}
printf(Success\\\
);
return 0;
}


$ nvcc -arch = sm_20 -o t383 t383.cu
$ ./t383
成功
$

请注意,我已免除

对于进一步优化,您可以使用每个网格的块参数( BLKS )和每个线程的参数c $ c> nTPB ),但是,在某种程度上,这些的最佳值将取决于您运行的GPU。


What is an efficient way to check a large matrix for inf/nan elements in CUDA (C++)? The matrix is stored as float* in the GPU memory. I don't need the location of those elements, just a boolean yes/no answer if at least one bad entry is present.

The options are:

  • have one kernel check the whole array (easy to implement but probably slow)
  • have multiple kernels check e.g. the rows and combine the output with OR (are there any CUDA builtins for doing this efficiently?)
  • ..other ideas?

Thanks!

解决方案

There are instrinsics for this, but the functions available for C99 should be fine:

isnan()

To test for inf, you can use:

isinf()

It's rarely faster to have multiple kernels do the same work of a single well written kernel, so I'm not sure why you think having a single kernel would be slow. This algorithm is likely to be memory-bound, so you want to focus on read data access efficiency, i.e. coalescing. In CUDA, the easy way to go through a matrix is to have each thread handle a column. This can be implemented efficiently with a for-loop and results in perfectly coalesced reads.

Since you only care about a single result with no indices, we can have multiple threads writing to a (boolean) result without atomics, for further efficiency, since any threads that might be writing to the result would all be writing the same value.

Another optimization strategy one might consider would be an early-exit strategy, but this does not optimize the worst-case timing, but in fact makes it longer, so I would dispense with that unless average throughput is a big issue.

Here's a complete worked example (using test for nan as an example):

$ cat t383.cu
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define DSIZEW 10000
#define DSIZEH 2000
#define nTPB 256
#define BLKS 16

__global__ void isnan_test(float *data, int width, int height, bool *result){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  while (idx < width){
    for (int i = 0; i < height; i++)
      if (isnan(data[(i*width) + idx])) *result = false;
    idx += gridDim.x+blockDim.x;
    }
}

int main(){

  float *d_data, *h_data;
  bool  *d_result, h_result=true;
  const char type = '0';

  cudaMalloc((void **)&d_data, sizeof(float)*DSIZEW*DSIZEH);
  cudaMalloc((void **)&d_result, sizeof (bool));
  h_data=(float *)malloc(sizeof(float)*DSIZEW*DSIZEH);
  for (int i=0; i<DSIZEH*DSIZEW; i++)
    h_data[i] = rand()/RAND_MAX;
  cudaMemcpy(d_data, h_data, sizeof(float)*DSIZEW*DSIZEH, cudaMemcpyHostToDevice);
  cudaMemcpy(d_result, &h_result, sizeof(bool), cudaMemcpyHostToDevice);
  isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
  cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
  if (!h_result) {printf("error in no-NAN check\n"); return 1;}
  float my_nan = nanf(&type); // create a NAN value
  cudaMemcpy(d_data, &my_nan, sizeof(float), cudaMemcpyHostToDevice);
  isnan_test<<<BLKS,nTPB>>>(d_data, DSIZEW, DSIZEH, d_result);
  cudaMemcpy(&h_result, d_result, sizeof(bool), cudaMemcpyDeviceToHost);
  if (h_result) {printf("error in NAN check\n"); return 1;}
  printf("Success\n");
  return 0;
}


$ nvcc -arch=sm_20 -o t383 t383.cu
$ ./t383
Success
$

Note that I have dispensed with proper cuda error checking for clarity/brevity, but that is always recommended.

For further optimization, you can play with the blocks per grid parameter (BLKS) and the threads per block parameter (nTPB), however, to some degree the optimal values of these will depend on which GPU you are running on.

这篇关于检查矩阵在CUDA中是否包含nans或无穷大的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆