朱莉娅:在多个GPU上并行进行CUSPARSE计算 [英] Julia: Parallel CUSPARSE calculations on multiple GPUs

查看:100
本文介绍了朱莉娅:在多个GPU上并行进行CUSPARSE计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有n个独立的GPU,每个GPU都存储自己的数据.我想让他们每个人同时执行一组计算. CUDArt文档此处描述了使用流异步调用自定义C内核以实现并行化(另请参见此处).使用自定义内核,可以通过在CUDArt的launch()函数实现中使用stream参数来实现.据我所知,但是,CUSPARSE(或CUBLAS)函数没有类似的流规范选项.

I have n separate GPUs, each storing its own data. I would like to have each of them perform a set of calculations simultaneously. The CUDArt documentation here describes the use of streams to asynchronously call custom C kernels in order to achieve parallelization (see also this other example here). With custom kernels, this can be accomplished through the use of the stream argument in CUDArt's implementation of the launch() function. As far as I can tell, however, the CUSPARSE (or CUBLAS) functions don't have a similar option for stream specification.

使用CUSPARSE是否有可能,或者如果我想使用多个GPU,是否只需要深入了解C?

Is this possible with CUSPARSE, or do I just need to dive down to the C if I want to use multiple GPUs?

修订后的赏金更新

好吧,最后,我现在有了一个相对不错的解决方案.但是,我敢肯定,它可以以一百万种方式加以改进-目前还算不算什么.特别是,我很喜欢按照

Ok, so, I now have a relatively decent solution working, finally. But, I'm sure it could be improved in a million ways - it's quite hacky right now. In particular, I'd love suggestions for solutions along the lines of what I tried and wrote about in this SO question (which I never got to work properly). Thus, I'd be delighted to award the bounty to anyone with further ideas here.

推荐答案

好,所以,我认为我终于遇到了至少相对有效的方法.我仍然非常高兴将Bounty提供给任何需要进一步改进的人.特别是,根据我尝试(但未能通过)实现的设计进行了改进,如

Ok, so, I think I've finally come upon something that works at least relatively well. I'd still be absolutely delighted to offer the Bounty though to anyone who has further improvements. In particular, improvements based on the design that I attempted (but failed) to implement as described in this SO question would be great. But, any improvements or suggestions on this and I'd be delighted to give the bounty.

我发现一种使CUSPARSE和CUBLAS之类的东西能够在多个GPU上并行化的方法的关键突破是,您需要为每个GPU创建一个单独的句柄.例如.摘自CUBLAS API上的文档:

The key breakthrough that I discovered for a way to get things like CUSPARSE and CUBLAS to parallelize over multiple GPUs is that you need to create a separate handle for each GPU. E.g. from the documentation on CUBLAS API:

应用程序必须通过调用cublasCreate()函数来初始化cuBLAS库上下文的句柄.然后,将显式传递给每个后续的库函数调用.应用程序完成使用库的操作后,必须调用cublasDestory()函数以释放与cuBLAS库上下文关联的资源.

The application must initialize the handle to the cuBLAS library context by calling the cublasCreate() function. Then, the is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the cuBLAS library context.

这种方法允许用户在使用多个主机线程和多个GPU时显式控制库设置.例如,应用程序可以使用cudaSetDevice()将不同的设备与不同的主机线程相关联,并且可以在每个主机线程中初始化cuBLAS库上下文的唯一句柄,它将使用与该主机线程相关联的特定设备.然后,使用不同句柄进行的cuBLAS库函数调用将自动将计算分派到不同的设备.

This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can use cudaSetDevice() to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the cuBLAS library context, which will use the particular device associated with that host thread. Then, the cuBLAS library function calls made with different handle will automatically dispatch the computation to different devices.

(添加了重点)

请参见此处

See here and here for some additional helpful docs.

现在,为了真正推进这一步,我不得不做一堆相当凌乱的黑客攻击.将来,我希望与开发CUSPARSE和CUBLAS软件包的人们取得联系,以了解如何将其合并到他们的软件包中.暂时,这是我所做的:

Now, in order to actually move forward on this, I had to do a bunch of rather messy hacking. In the future, I'm hoping to get in touch with the folks who developed the CUSPARSE and CUBLAS packages to see about incorporating this into their packages. For the time being though, this is what I did:

首先,CUSPARSE和CUBLAS软件包带有创建句柄的函数.但是,我不得不稍微修改一下程序包以导出这些功能(以及所需的其他功能和对象类型),这样我才可以实际访问它们.

First, the CUSPARSE and CUBLAS packages come with functions to create handles. But, I had to modify the packages a bit to export those functions (along with needed other functions and object types) so that I could actually access them myself.

具体地说,我在CUSPARSE.jl中添加了以下内容:

Specifically, I added to CUSPARSE.jl the following:

export libcusparse, SparseChar

libcusparse_types.jl以下内容:

export cusparseHandle_t, cusparseOperation_t, cusparseMatDescr_t, cusparseStatus_t

libcusparse.jl以下内容:

export cusparseCreate

以及sparse.jl以下内容:

export getDescr, cusparseop

通过所有这些,我能够对cusparseCreate()函数进行功能访问,该函数可用于创建新的句柄(我不能只使用CUSPARSE.cusparseCreate(),因为该函数依赖于许多其他函数,并且数据类型).从那里,我定义了一个新版本的矩阵乘法运算,希望它带有一个附加参数Handle,将ccall()馈入CUDA驱动程序.下面是完整的代码:

Through all of these, I was able to get functional access to the cusparseCreate() function which can be used to create new handles (I couldn't just use CUSPARSE.cusparseCreate() because that function depended on a bunch of other functions and data types). From there, I defined a new version of the matrix multiplication operation that I wanted that took an additional argument, the Handle, to feed in the ccall() to the CUDA driver. Below is the full code:

using CUDArt, CUSPARSE  ## note: modified version of CUSPARSE, as indicated above.

N = 10^3;
M = 10^6;
p = 0.1;

devlist = devices(dev->true);
nGPU = length(devlist)

dev_X = Array(CudaSparseMatrixCSR, nGPU)
dev_b = Array(CudaArray, nGPU)
dev_c = Array(CudaArray, nGPU)
Handles = Array(Array{Ptr{Void},1}, nGPU)


for (idx, dev) in enumerate(devlist)
    println("sending data to device $dev")
    device(dev) ## switch to given device
    dev_X[idx] = CudaSparseMatrixCSR(sprand(N,M,p))
    dev_b[idx] = CudaArray(rand(M))
    dev_c[idx] = CudaArray(zeros(N))
    Handles[idx] = cusparseHandle_t[0]
    cusparseCreate(Handles[idx])
end


function Pmv!(
    Handle::Array{Ptr{Void},1},
    transa::SparseChar,
    alpha::Float64,
    A::CudaSparseMatrixCSR{Float64},
    X::CudaVector{Float64},
    beta::Float64,
    Y::CudaVector{Float64},
    index::SparseChar)
    Mat     = A
    cutransa = cusparseop(transa)
    m,n = Mat.dims
    cudesc = getDescr(A,index)
    device(device(A))  ## necessary to switch to the device associated with the handle and data for the ccall 
    ccall(
        ((:cusparseDcsrmv),libcusparse), 

        cusparseStatus_t,

        (cusparseHandle_t, cusparseOperation_t, Cint,
        Cint, Cint, Ptr{Float64}, Ptr{cusparseMatDescr_t},
        Ptr{Float64}, Ptr{Cint}, Ptr{Cint}, Ptr{Float64},
        Ptr{Float64}, Ptr{Float64}), 

        Handle[1],
        cutransa, m, n, Mat.nnz, [alpha], &cudesc, Mat.nzVal,
        Mat.rowPtr, Mat.colVal, X, [beta], Y
    )
end

function test(Handles, dev_X, dev_b, dev_c, idx)
    Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
    device(idx-1)
    return to_host(dev_c[idx])
end


function test2(Handles, dev_X, dev_b, dev_c)

    @sync begin
        for (idx, dev) in enumerate(devlist)
            @async begin
                Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
            end
        end
    end
    Results = Array(Array{Float64}, nGPU)
    for (idx, dev) in enumerate(devlist)
        device(dev)
        Results[idx] = to_host(dev_c[idx]) ## to_host doesn't require setting correct device first.  But, it is  quicker if you do this.
    end

    return Results
end

## Function times given after initial run for compilation
@time a = test(Handles, dev_X, dev_b, dev_c, 1); ## 0.010849 seconds (12 allocations: 8.297 KB)
@time b = test2(Handles, dev_X, dev_b, dev_c);   ## 0.011503 seconds (68 allocations: 19.641 KB)

# julia> a == b[1]
# true

这篇关于朱莉娅:在多个GPU上并行进行CUSPARSE计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆