是否可以从CUDA 10.1内核中调用cuBLAS或cuBLASLt函数? [英] Is it possible to call cuBLAS or cuBLASLt functions from CUDA 10.1 kernels?

查看:177
本文介绍了是否可以从CUDA 10.1内核中调用cuBLAS或cuBLASLt函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于CUDA 10.1

Concerning CUDA 10.1

我正在对几何网格进行一些计算,每个网格的每个面都进行了大量的独立计算.我运行一个CUDA内核,该内核会为每张面孔进行计算.

I'm doing some calculations on geometric meshes with a large amount of independent calculations done per face of the mesh. I run a CUDA kernel which does the calculation for each face.

计算涉及一些矩阵乘法,因此我想使用cuBLAS或cuBLASLt加快速度.由于我需要进行许多矩阵乘法(每张脸至少要进行两次),所以我想直接在内核中进行.这可能吗?

The calculations involve some matrix multiplication, so I'd like to use cuBLAS or cuBLASLt to speed things up. Since I need to do many matrix multiplications (at least a couple per face) I'd like to do it directly in the kernel. Is this possible?

看起来cuBLAS或cuBLASLt似乎不允许您从内核(__global__)代码中调用它们的函数.我从Visual Studio中收到以下错误:

It doesn't seem like cuBLAS or cuBLASLt allows you to call their functions from kernel (__global__) code. I get the following error from Visual Studio:

不允许从__device__函数调用__host__函数"

"calling a __host__ function from a __device__ function is not allowed"

有一些旧答案( CUDA内核可以调用cublas吗?函数?)暗示这是可能的?

There are some old answers (Could a CUDA kernel call a cublas function?) that imply that this is possible though?

基本上,我想要一个这样的内核:

Basically, I'd like a kernel like this:

__global__
void calcPerFace(...)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < faceCount; i += stride)
    {
        // Calculate some matrices for each face in the mesh
        ...
        // Multiply those matrices
        cublasLtMatmul(...) // <- not allowed by cuBLASLt
        // Continue calculation
        ...
    }
}

在CUDA 10.1中是否可以从这样的内核中调用cublasLtMatmul或也许cublassgemm?

Is it possible to call cublasLtMatmul or perhaps cublassgemm from a kernel like this in CUDA 10.1?

推荐答案

不可能

从CUDA 10.0开始,CUDA不再支持从设备代码调用CUBLAS例程的功能.

Starting with CUDA 10.0, CUDA no longer supports the ability to call CUBLAS routines from device code.

弃用通知在CUDA 10.0之前发布,正式声明存在于 CUDA 10.0发行说明:

A deprecation notice was given prior to CUDA 10.0, and the formal announcement exists in the CUDA 10.0 release notes:

从CUDA 10.0开始删除了cuBLAS库,以支持从设备例程(cublas_device)中调用相同的cuBLAS API的功能.

The cuBLAS library, to support the ability to call the same cuBLAS APIs from within the device routines (cublas_device), is dropped starting with CUDA 10.0.

同样,依赖此功能的CUDA示例代码(例如 simpleDevLibCUBLAS )已不再是CUDA工具包发行版的一部分,从CUDA 10.0开始.

Likewise, CUDA sample codes that depended on this capability, such as simpleDevLibCUBLAS, are no longer part of the CUDA toolkit distribution, starting with CUDA 10.0.

这仅适用于CUBLAS,并不意味着CUDA动态并行性的一般功能已被删除.

This applies to CUBLAS only, and does not mean that the general capability of CUDA dynamic parallelism has been removed.

我将无法回答询问为什么?"的问题还是为什么"的变体?我将无法回答有关未来事件或主题的问题.没有技术原因无法使用该功能或​​不支持该功能.改变的原因与发展和资源优先事项有关.我将无法比这更深入.如果您希望看到CUDA的行为发生变化,无论是功能,性能还是文档方面的变化,建议您通过在

I won't be able to respond to questions that ask "why?" or are variants of "why?" I won't be able to respond to questions that ask about future events or topics. There are no technical reasons that this functionality was not workable or could not be supported. The reasons for the change had to do with development and resource priorities. I won't be able to go deeper than that. If you would like to see a change in behavior for CUDA, whether that be in functionality, performance, or documentation, you are encouraged to express your desire by filing a bug at http://developer.nvidia.com. The specific bug filing instructions are linked here.

对于执行一些准备工作的CUDA设备代码,然后调用CUBLAS,然后执行其他工作,一般建议是将其分解为执行准备工作的内核,然后从主机启动所需的CUBLAS例程,然后在后续内核中执行其余工作.这并不意味着必须在设备和主机之间来回移动数据.当将要执行多个CUBLAS调用时(例如,每个设备线程),那么研究各种可用的CUBLAS批处理功能可能是有益的.不可能给出一个配方来重构每种代码.这些建议可能无法解决所有情况.

For CUDA device code that performs some preparatory work, then calls CUBLAS, then performs some other work, the general suggestion would be to break this into a kernel that performs the preparatory work, then launch the desired CUBLAS routines from the host, then perform the remaining work in a subsequent kernel. This does not imply that data would have to be moved back and forth between device and host. When multiple CUBLAS calls would have been performed (e.g. per device thread) then it may be beneficial to investigate the various kinds of CUBLAS batched functionality that are available. It's not possible to give a single recipe to refactor every kind of code. These suggestions may not address every case.

这篇关于是否可以从CUDA 10.1内核中调用cuBLAS或cuBLASLt函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆