用Fortran& CUDA [英] Calculating PI with Fortran & CUDA

查看：291 发布时间：2017/3/5 18:51:22 cuda fortran pgi

本文介绍了用Fortran& CUDA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在PGI的fortran编译器中制作一个简单的程序。这个简单的程序将使用显卡计算pi使用飞镖算法。在与这个程序作战了一段时间后，我终于得到它的行为大部分。但是，我目前坚持正确传回结果。我必须说，这是一个相当棘手的程序调试，因为我不能再把任何打印语句推入子程序。此程序当前返回全零。我不知道发生了什么，但我有两个想法。这两个我不知道如何解决：

CUDA内核不是以某种方式运行？

我没有正确地转换值？ pi_parts = pi_parts_d

这是我目前程式的状态。所有具有 _d 的变量代表CUDA准备的设备内存，其中所有其他变量（CUDA内核除外）是典型的Fortran CPU准备的变量。现在有一些打印语句我已经注释掉，我已经尝试从CPU Fortran土地。这些命令是检查我是否真的正确地生成随机数。至于CUDA方法，我目前已经注释掉了计算，并将 z 替换为静态等于 1

 模块calcPi 
包含
属性（全局）子程序pi_darts（x，y，results ，N）
 use cudafor 
 implicit none 
 integer :: id 
 integer，value :: N 
 real，dimension（N）:: x，y，results 
 real :: z 
 
 id =（blockIdx％x-1）* blockDim％x + threadIdx％x 
 
 if（id .lt。 
！ SQRT NOT NEEDED，SQRT（1）=== 1 
！任何高于和低于1将保持不变，即使应用
！ sqrt函数。因此使用sqrt函数会浪费GPU时间。 
z = 1.0 
！z = x（id）* x（id）+ y（id）* y（id）
！if（z .lt 1.0）then 
 ！ z = 1.0 
！else 
！ z = 0.0 
！endif 
结果（id）= z 
 endif 
结束子程序pi_darts 
结束模块calcPi 
 
程序final_project 
使用calcPi 
使用cudafor 
隐含无
整数，参数:: N = 400 
 integer :: i 
 real，dimension（N）:: x， y，pi_parts 
 real，dimension（N），device :: x_d，y_d，pi_parts_d 
 type（dim3）:: grid，tBlock 
 
！初始化随机数生成器种子
 call random_seed（）
 
！确保我们用0初始化零件
 pi_parts = 0 
 
！准备随机数（这些不能从
！cuda内部生成）
调用random_number（x）
调用random_number（y）
 
！write（* ，*）x，y 
 
！将随机数转换为显卡内存！ 
 x_d = x 
 y_d = y 
 pi_parts_d = pi_parts 
 
！对于cuda内核
 tBlock = dim3（256,1,1）
 grid = dim3（（N / tBlock％x）+1,1,1）
 
！启动cuda内核
调用pi_darts<<<< grid，tblock>>>（x_d，y_d，pi_parts_d，N）
 
！将结果转换为CPU内存
 pi_parts = pi_parts_d 
 write（*，*）pi_parts 
 
 write（*，*）'PI：'，4.0 * sum（pi_parts）/ N 
 end program final_project

编辑到代码：
更改了各种行以反映由 Robert Crovella 提及的修复。当前状态：捕获的错误cuda-memcheck 显示：程序命中错误8对CUDA API调用cudaLaunch 。

如果有任何方法我可以用来测试这个程序，请让我知道。我投掷飞镖，看到他们到达的地方，我目前的调试风格与CUDA。

愿Fortran神在这黑暗的时刻怜悯我的灵魂。

解决方案

当我编译和运行你的程序时，我得到一个segfault。这是由于你传递给内核的最后一个参数（ N_d ）：

  call pi_darts<<< grid，tblock>>>（x_d，y_d，pi_parts_d，N_d）

由于 N 是一个标量，内核期望直接使用它，而不是作为一个指针。因此，当你传递一个指向器件数据（ N_d ）的指针时，设置内核的过程会在尝试访问该值时生成一个seg fault（在主机代码！ N ，应直接传入：

 呼叫pi_darts< ;< grid，tblock>>（x_d，y_d，pi_parts_d，N）

我改变你发布的代码，然后得到实际的打印输出（而不是一个seg故障），这是一个一和零的数组（256个，其次是144个零，总共 N = 400个值），后面是计算的PI值（在这种情况下恰好是2.56）。因为你已经使内核基本上是一个虚拟内核。

这行代码也可能不是你想要的：

  grid = dim3（N / tBlock％x，1,1）

c> N = 400和 tBlock％x = 256（来自前面的代码行），计算结果为1 grid 结束于（1,1,1），相当于一个threadblock）。但你真的想启动2个线程块，以覆盖你的数据集（ N = 400个元素）的整个范围。有很多方法可以解决这个问题，但为了简单起见，我们只是总是添加1到计算：

  grid = dim3在这种情况下，当我们启动网格时，我们可以使用一个新的网格（总线程数）比我们的数据集大小（512个线程，但在这个例子中只有400个数据元素）更大，通常在我们的内核的开始附近放置一个线程检查 case，在初始化 id 之后），以防止超越访问，如下所示：
  if（id .lt.N）then 
  
  endif 在内核代码的最后）这样，只有对应于实际有效数据的线程才允许做任何工作。
 
 
 通过上述更改，您的代码应该基本上是功能性的，您应该能够将内核代码恢复到正确的语句，并开始获取PI的估计。
 
 
 请注意，您可以检查CUDA API的错误返回代码，也可以使用 cuda-memcheck 运行代码，以了解是否内核正在进行超出界限的访问。然而，更多的这些将帮助这个特殊的seg故障。
 
I am trying to make a simple program in PGI's fortran compiler. This simple program will use the graphics card to calculate pi using the "dart board" algorithm. After battling with this program for quite some time now I have finally got it to behave for the most part. However, I am currently stuck on passing back the results properly. I must say, this is a rather tricky program to debug since I can no longer shove any print statements into the subroutine. This program currently returns all zeros. I am not really sure what is going on, but I have two ideas. Both of which I am not sure how to fix:

The CUDA kernel is not running somehow?
I am not converting the values properly? pi_parts = pi_parts_d
Well, this is the status of my current program. All variables with _d on the end stand for the CUDA prepared device memory where all the other variables (with the exception of the CUDA kernel) are typical Fortran CPU prepared variables. Now there are some print statements I have commented out that I have already tried out from CPU Fortran land. These commands were to check if I really was generating the random numbers properly. As for the CUDA method, I have currently commented out the calculations and replaced z to statically equal to 1 just to see something happen.
module calcPi
contains
    attributes(global) subroutine pi_darts(x, y, results, N)
        use cudafor
        implicit none
        integer :: id
        integer, value :: N
        real, dimension(N) :: x, y, results
        real :: z

        id = (blockIdx%x-1)*blockDim%x + threadIdx%x

        if (id .lt. N) then
            ! SQRT NOT NEEDED, SQRT(1) === 1
            ! Anything above and below 1 would stay the same even with the applied
            ! sqrt function. Therefore using the sqrt function wastes GPU time.
            z = 1.0
            !z = x(id)*x(id)+y(id)*y(id)
            !if (z .lt. 1.0) then
            !   z = 1.0
            !else
            !   z = 0.0
            !endif
            results(id) = z
        endif
    end subroutine pi_darts
end module calcPi

program final_project
    use calcPi
    use cudafor
    implicit none
    integer, parameter :: N = 400
    integer :: i
    real, dimension(N) :: x, y, pi_parts
    real, dimension(N), device :: x_d, y_d, pi_parts_d
    type(dim3) :: grid, tBlock

    ! Initialize the random number generaters seed
    call random_seed()

    ! Make sure we initialize the parts with 0
    pi_parts = 0

    ! Prepare the random numbers (These cannot be generated from inside the
    ! cuda kernel)
    call random_number(x)
    call random_number(y)

    !write(*,*) x, y

    ! Convert the random numbers into graphics card memory land!
    x_d = x
    y_d = y
    pi_parts_d = pi_parts

    ! For the cuda kernel
    tBlock = dim3(256,1,1)
    grid = dim3((N/tBlock%x)+1,1,1)

    ! Start the cuda kernel
    call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)

    ! Transform the results into CPU Memory
    pi_parts = pi_parts_d
    write(*,*) pi_parts

    write(*,*) 'PI: ', 4.0*sum(pi_parts)/N
end program final_project
EDIT TO CODE:
Changed various lines to reflect the fixes mentioned by: Robert Crovella. Current status:  error caught by cuda-memcheck revealing: Program hit error 8 on CUDA API call to cudaLaunch on my machine.

If there is any method I can use to test this program please let me know. I am throwing darts and seeing where they land for my current style of debugging with CUDA. Not the most ideal, but it will have to do until I find another way.

May the Fortran Gods have mercy on my soul at this dark hour.
 解决方案 
When I compile and run your program I get a segfault.  This is due to the last parameter you are passing to the kernel (N_d):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer.  So when you pass a pointer to device data (N_d), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).

This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N = 400 and tBlock%x = 256 (from previous code lines), the result of the calculation is 1 (ie. grid ends up at (1,1,1) which amounts to one threadblock).  But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N = 400 elements).  There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif at the very end of the kernel code)  This way, only the threads that correspond to actual valid data are allowed to do any work.

With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.

Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck to get an idea of whether the kernel is making out-of-bounds accesses.  Niether of these would have helped with this particular seg fault, however.

                        这篇关于用Fortran&amp; CUDA的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

用Fortran& CUDA [英] Calculating PI with Fortran & CUDA

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

用Fortran&amp; CUDA [英] Calculating PI with Fortran &amp; CUDA

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

用Fortran& CUDA [英] Calculating PI with Fortran & CUDA

登录关闭