从C返回一个指向设备分配矩阵到Fortran [英] Returning a pointer to a device allocated matrix from C to Fortran

查看:182
本文介绍了从C返回一个指向设备分配矩阵到Fortran的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我有点新的Fortran与/ C / CUDA。其次,我正在上使用CUBLAS的GPU执行矩阵向量乘法一个Fortran / C程序。我需要乘以多个(最多1000)之前,我需要更新的内容矩阵与一个矩阵向量。然而,目前的版本我有具有每一个新的载体被发送到GPU时间(因为矩阵没有改变这是难以置信的浪费和慢)来重新分配矩阵

First off, I'm kind of new with Fortran/C/CUDA. Secondly, I'm working on a Fortran/C program that performs matrix vector multiplication on the GPU using cuBLAS. I need to multiply multiple (up to 1000) vectors with the one matrix before I need to update the matrix contents. However, the current version I have has to reallocate the matrix every time a new vector is sent to the GPU (which is incredibly wasteful and slow since the matrix hasn't changed).

欲能够乘用载体基质,而不必重新分配矩阵为每个向量。我曾参与调用单独的C函数,将矩阵分配给GPU的想法,返回一个指针的Fortran主程序,然后调用执行矩阵向量乘法另一个C函数。

I want to be able to multiply the matrix with the vector without having to reallocate the matrix for every vector. An idea I had involved calling a separate C function that would allocate the matrix to the GPU, returns a pointer to the Fortran main program, and then calls another C function that performs the matrix vector multiplication.

使用ISO_C_BINDING,我回一个指向一个浮点数到变量:

Using ISO_C_BINDING, I returned a pointer to a floating point number into the variable:

type(C_PTR) :: ptr

,当我试图将此传递到矩阵向量C函数:

and when I try to pass this into the matrix vector C function:

用Fortran

call cudaFunction(ptr,vector, N)

用C

extern "C" void cudaFunction_(float *mat, float *vector, int *N)

一切编译和运行,但cublasSgemv执行无法执行。为什么这会发生什么想法?我见过几个职位一种相关的,但他们从来没有尝试返回的指针送回℃,这就是(我相信)我遇到的问题。

everything compiles and runs, but the execution of cublasSgemv fails to execute. Any ideas on why this would be happening? I've seen a few post kind of related but they never try to send the returned pointer back to C and this is where (I believe) I am having the issue.

在此先感谢!

推荐答案

我建议你不要推倒重来,但使用的 CUBLAS 提供用于此目的的FORTRAN绑定。

I would suggest that you not reinvent the wheel, but use the cublas fortran bindings that are provided for this purpose.

在的thunk包装是不是你想要的。它隐含的复制操作需要,任何时候你在使用一个FORTRAN CUBLAS电话。

The "thunking" wrapper is not what you want. It does implicit copy operations as needed, any time you use a cublas call in fortran.

您希望非的thunk的包装,所以你可以复制回事明确的控制。您可以使用FORTRAN等值获取/ SetMatrix 获取/ SetVector 来复制数据来回。

You want the "non-thunking" wrapper, so you have explicit control over the copying going on. You can use fortran equivalents of Get/SetMatrix and Get/SetVector to copy data back and forth.

有一个样品code(例如B.2)展示了如何使用包含在CUBLAS文档中的非的thunk包装。

There is a sample code (example B.2) showing how to use the non-thunking wrapper included in the cublas documentation.

即使你想重新发明轮子,包装会告诉你如何进行必要的语法工作,C和Fortran之间移动。

Even if you do want to re-invent the wheel, the wrapper will show you how to make the necessary syntax work to move between C and Fortran.

在一个标准的Linux安装CUDA,该包装在的/ usr /本地/ CUDA / src目录
非的thunk包装是 /usr/local/cuda/src/fortran.c

On a standard linux CUDA install, the wrappers are in /usr/local/cuda/src The non-thunking wrapper is /usr/local/cuda/src/fortran.c

下面是一个完整的工作例如:

Here's a fully worked example:

cublasf.f:

cublasf.f:

      program cublas_fortran_example
      implicit none
      integer i, j

c     helper functions
      integer cublas_init
      integer cublas_shutdown
      integer cublas_alloc
      integer cublas_free
      integer cublas_set_vector
      integer cublas_get_vector
c     selected blas functions
      double precision cublas_ddot
      external cublas_daxpy
      external cublas_dscal
      external cublas_dcopy
      double precision cublas_dnrm2
c     cublas variables
      integer cublas_status
      real*8 x(30), y(30)
      double precision alpha, beta
      double precision nrm
      integer*8 d_x, d_y, d_alpha, d_beta, d_nrm
      integer*8 dsize1, dlength1, dlength2
      double precision dresult



      write(*,*) "testing cublas fortran example"

c     initialize cublas library
c     CUBLAS_STATUS_SUCCESS=0
      cublas_status = cublas_init()
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS Library initialization failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
c     initialize data
      do j=1,30
        x(j) = 1.0
        y(j) = 2.0
      enddo
      dsize1 = 8
      dlength1 = 30
      dlength2 = 1
      alpha = 2.0
      beta = 3.0
c     allocate device storage
      cublas_status = cublas_alloc(dlength1, dsize1, d_x)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength1, dsize1, d_y)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_alpha)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_beta)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_nrm)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif

c     copy data from host to device

      cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2,
     >     d_x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2,
     >     d_y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif

      dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2)
      write(*,*) "dot product result=",dresult

      dresult = cublas_dnrm2(dlength1, d_x, dlength2)
      write(*,*) "nrm2 of x result=",dresult

      dresult = cublas_dnrm2(dlength1, d_y, dlength2)
      write(*,*) "nrm2 of y result=",dresult

      call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "daxpy y(1)  =", y(1)
      write(*,*) "daxpy y(30) =", y(30)


      call cublas_dscal(dlength1, beta, d_x, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2,
     >     x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dscal x(1)  =", x(1)
      write(*,*) "dscal x(30) =", x(30)


      call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dcopy y(1)  =", y(1)
      write(*,*) "dcopy y(30) =", y(30)

c     deallocate GPU memory and exit
      cublas_status = cublas_free(d_x)
      cublas_status = cublas_free(d_y)
      cublas_status = cublas_free(d_alpha)
      cublas_status = cublas_free(d_beta)
      cublas_status = cublas_free(d_nrm)
      cublas_status = cublas_shutdown()
      stop
      end

编译/运行:

$ gfortran -c -o cublasf.o cublasf.f
$ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c
$ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o
$ ./cublasf
 testing cublas fortran example
 dot product result=   60.0000000000000
 nrm2 of x result=   5.47722557505166
 nrm2 of y result=   10.9544511501033
 daxpy y(1)  =   4.00000000000000
 daxpy y(30) =   4.00000000000000
 dscal x(1)  =   3.00000000000000
 dscal x(30) =   3.00000000000000
 dcopy y(1)  =   3.00000000000000
 dcopy y(30) =   3.00000000000000
$ 

CUDA 5.0,RHEL 5.5

CUDA 5.0, RHEL 5.5

这篇关于从C返回一个指向设备分配矩阵到Fortran的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆