难道虽然不CUDA内核内部工作 [英] Do While don't work inside CUDA Kernel

查看：144 发布时间：2016/8/24 14:47:23 c cuda gpgpu

本文介绍了难道虽然不CUDA内核内部工作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

好吧，我是pretty新到CUDA，和我有点失落，真的丢失了。

我试图用蒙特卡罗法计算圆周率，并在年底我刚刚得到一个补充，而不是50。

我不希望做，而调用内核，因为它是太慢了。我的问题是，我的code不循环，它在内核中执行一次。

还有，我想所有的线程访问相同的硝石和PI，因此当某个线程打到柜台所有的人将停止。

 的#define SEED 35791246__shared__ INT硝石;
__shared__双PI;__global__无效calcularPi（）{    双X;
    双Y;
    诠释计数;
    双Z者除外;    计数= 0;
    硝石= 0;    //继续循环
    做{        硝石=硝石+ 1;        //生成随机数
        curandState状态;
        curand_init（SEED（INT）硝石，0，＆安培状态）;
        X = curand（安培状态）;
        Y = curand（安培状态）;        Z = X * X + Y * Y;
        如果（z，其中，= 1）计数++;
     PI =（双）计数/硝石* 4;    }而（硝石＆LT; 50）;}诠释主要（无效）{    浮tempoTotal;
    //开始计时器
    clock_t表示吨;
    T =时钟（）;    //调用内核
    calcularPi＆LT;＆LT;＆LT; 1,32＆GT;＆GT;＆GT;（）;    //等待而内核完成
    cudaDeviceSynchronize（）;    typeof运算（PI）piFinal;
    cudaMemcpyFromSymbol（安培; piFinal，PI，sizeof的（piFinal），0，cudaMemcpyDeviceToHost）;    typeof运算（硝石）niterFinal;
    cudaMemcpyFromSymbol（安培; niterFinal硝石的sizeof（niterFinal），0，cudaMemcpyDeviceToHost）;    //结束计时器
    T =时钟（） - 吨;
    tempoTotal =（（双）T）/ CLOCKS_PER_SEC;
    的printf（皮：％G \\ N，piFinal）;
    的printf（增加数：％d \\ n，niterFinal）;
    的printf（总时间：％F \\ N，tempoTotal）;}

解决方案

有多种与code的问题。

我建议使用<一个href=\"http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api\">proper CUDA错误检查和运行code CUDA-MEMCHECK 来发现任何运行时错误。我省略了正确的错误在下面为presentation的简短我的code检查，但我已经对运行CUDA-MEMCHECK 来表示没有运行时错误

您 curand的使用（）可能是不正确（它返回一个整数在大范围内）。对于这个code才能正常工作，你想0和1的正确的调用该是 curand_uniform（）。

既然你想要的所有线程都在相同的价值观，必须prevent从踩到对方的线程。做到这一点的方法之一是使用有关的变量的原子更新

它不应该是必要重新运行 curand_init 在每次迭代。一旦每个线程应该足够了。

我们不使用 cudaMemcpy..Symbol 操作上 __ __共享变量。为方便起见，以及preserve一些类似于原来的code，我当选为那些转换为 __设备__ 变量。

下面是一个修改版本的code的有上述大部分已修复的问题：

  $猫t978.cu
＃包括LT＆;＆curand.h GT;
＃包括LT＆;＆curand_kernel.h GT;
＃包括LT＆;＆stdio.h中GT;＃定义ITER_MAX 5000
SEED的#define 35791246__device__ INT硝石;
__device__诠释计数;__global__无效calcularPi（）{    双X;
    双Y;
    双Z者除外;
    INT lcount;
    curandState状态;
    curand_init（SEED，threadIdx.x，0，＆安培状态）;
    //继续循环
    做{        lcount = atomicAdd（安培;硝石，1）;        //生成随机数
        X = curand_uniform（安培状态）;
        Y = curand_uniform（安培状态）;        Z = X * X + Y * Y;
        如果（z，其中; = 1）atomicAdd（安培;计数，1）;    }而（lcount＆LT; ITER_MAX）;}诠释主要（无效）{    浮tempoTotal;
    //开始计时器
    clock_t表示吨;
    T =时钟（）;
    INT count_final = 0;
    INT niter_final = 0;
    cudaMemcpyToSymbol（硝石，和放大器; niter_final，sizeof的（INT））;
    cudaMemcpyToSymbol（计数，和放大器; count_final，sizeof的（INT））;
    //调用内核
    calcularPi＆LT;＆LT;＆LT; 1,32＆GT;＆GT;＆GT;（）;    //等待而内核完成
    cudaDeviceSynchronize（）;
    cudaMemcpyFromSymbol（安培; count_final，计数的sizeof（INT））;
    cudaMemcpyFromSymbol（安培; niter_final，硝石，sizeof的（INT））;    //结束计时器
    双PI = count_final /（双）niter_final * 4;
    T =时钟（） - 吨;
    tempoTotal =（（双）T）/ CLOCKS_PER_SEC;
    的printf（皮：％G \\ N，PI）;
    的printf（增加数：％d \\ n，niter_final）;
    的printf（总时间：％F \\ N，tempoTotal）;}
$ NVCC -o t978 t978.cu -lcurand
$ CUDA-MEMCHECK ./t978
========= CUDA-MEMCHECK
皮：3.12083
添加：5032
总时间：0.558463
=========错误摘要：0错误
$

我修改了迭代数量较多，但你可以使用50如果你想为 ITER_MAX 。

请注意，有可能会反对这个code平整很多批评。我来这里的目的，因为它显然是一个学习锻炼，是指出什么样的变化的最小数量可以得到一个功能code，使用你已经列出的算法。作为一个例子，你可能想改变你的内核启动配置（＆LT;＆LT;＆LT; 1.32＆GT;＆GT;＆GT; ）以外，较大的数字，在为了更充分地利用GPU的

Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.

I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.

I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.

And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.

#define SEED 35791246

__shared__ int niter;
__shared__ double pi;

__global__ void calcularPi(){

    double x;
    double y;
    int count;
    double z;

    count = 0;
    niter = 0;

    //keep looping
    do{

        niter = niter + 1;

        //Generate random number
        curandState state;
        curand_init(SEED,(int)niter, 0, &state);
        x = curand(&state);
        y = curand(&state);

        z = x*x+y*y;
        if (z<=1) count++;
     pi =(double)count/niter*4;

    }while(niter < 50);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();

    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();

    typeof(pi) piFinal;
    cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);

    typeof(niter) niterFinal;
    cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);

    //Ends timer
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", piFinal);
    printf("Adds: %d \n", niterFinal);
    printf("Total time: %f \n", tempoTotal);

}

解决方案

There are a variety of issues with your code.

I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.
Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().
Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.
It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.
We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.

Here's a modified version of your code that has most of the above issues fixed:

$ cat t978.cu
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>

#define ITER_MAX 5000
#define SEED 35791246

__device__ int niter;
__device__ int count;

__global__ void calcularPi(){

    double x;
    double y;
    double z;
    int lcount;
    curandState state;
    curand_init(SEED,threadIdx.x, 0, &state);
    //keep looping
    do{

        lcount = atomicAdd(&niter, 1);

        //Generate random number
        x = curand_uniform(&state);
        y = curand_uniform(&state);

        z = x*x+y*y;
        if (z<=1) atomicAdd(&count, 1);

    }while(lcount < ITER_MAX);

}

int main(void){

    float tempoTotal;
    //Start timer
    clock_t t;
    t = clock();
    int count_final = 0;
    int niter_final = 0;
    cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
    cudaMemcpyToSymbol(count, &count_final, sizeof(int));
    //call kernel
    calcularPi<<<1,32>>>();

    //wait while kernel finish
    cudaDeviceSynchronize();
    cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
    cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));

    //Ends timer
    double pi = count_final/(double)niter_final*4;
    t = clock() - t;
    tempoTotal = ((double)t)/CLOCKS_PER_SEC;
    printf("Pi: %g \n", pi);
    printf("Adds: %d \n", niter_final);
    printf("Total time: %f \n", tempoTotal);

}
$ nvcc -o t978 t978.cu -lcurand
$ cuda-memcheck ./t978
========= CUDA-MEMCHECK
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
$

I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.

Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.

这篇关于难道虽然不CUDA内核内部工作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

难道虽然不CUDA内核内部工作 [英] Do While don't work inside CUDA Kernel

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

难道虽然不CUDA内核内部工作 [英] Do While don&#39;t work inside CUDA Kernel

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

难道虽然不CUDA内核内部工作 [英] Do While don't work inside CUDA Kernel

登录关闭