从主机访问CUDA全局设备变量 [英] Access CUDA global device variable from host

查看:339
本文介绍了从主机访问CUDA全局设备变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有官方消息来源,为什么以下方法起作用:

I was wondering if there is an official source, why the following works:

#include <iostream>

struct Array{
    int el[10000];
};

__device__ Array devAr;

void test(Array& ar = devAr){
    for(int i=0; i<10000; i++)
        ar.el[i] = i;
    std::cout << ar.el[0] + ar.el[9999] << std::endl;
}

int main(){
    test();
}

您将收到警告 __device__变量 devAr无法直接读取主机功能,如果您尝试直接访问devAr,但通过引用则没有此类警告(有充分的理由)。但是在两种情况下,都可以从主机访问变量。似乎有一个该变量的宿主实例。

You get a warning "a __device__ variable "devAr" cannot be directly read in a host function" if you try to access devAr directly but through the reference there is no such warning (for good reason). But in both cases it is possible to access the variable from the host. So it seems, there is a host instance of that variable.

我需要知道的事情:我可以认为这是理所当然的吗?

What I need to know: Can I take this for granted?

其他显示指针值的测试用例:

Other testcase showing the values of the pointers:

#include <iostream>
#include <cstdio>

__device__ int devAr[2];

__global__ void foo(){
    printf("Device: %p\n", &devAr);
    devAr[0] = 1337;
}

int main()
{
    devAr[0] = 4;
    std::cout << devAr[0] << std::endl;
    void* ad;
    cudaGetSymbolAddress(&ad, devAr);
    std::cout << ad << " " << &devAr << std::endl;
    foo<<<1,1>>>();
    cudaDeviceSynchronize();
    int arHost[2];
    cudaMemcpyFromSymbol(arHost, devAr, sizeof(arHost), 0);
    std::cout << "values: " << arHost[0] << std::endl;
}

输出:


4

0x500bc0000 0x66153c

设备:0x500bc0000

值:1337

4
0x500bc0000 0x66153c
Device: 0x500bc0000
values: 1337


推荐答案

您做的是无效的,应该听警告:

What you are doing is invalid and you should listen to the warning:


a __ device __ 变量 devAr 不能在主机函数中直接读取

a __device__ variable devAr cannot be directly read in a host function

首先,我将代码简化为仅显示问题所需的大小:

First let me simplify your code a bit to only size necessary to show the issue:

#include <iostream>

__device__ int devAr[1];

int main()
{
    devAr[0] = 4;
    std::cout << devAr[0] << std::endl;
}

现在发生了什么事:


  1. __ device__ int devAr [1]; 在设备内存中分配固定大小的数组,并将指针存储到该设备内存放在 devAr 变量中(出现警告)。

  2. devAr 地址指向有效的设备存储器,但是,即使主机代码中也可以使用该地址,因为主机和设备存储器使用的地址格式相同。但是,在主机代码 devAr 中指向一些随机的未初始化的主机内存

  3. 基于上文可以说 devAr [0] = 4; 只是将 4 写入主机内存中的某个随机未初始化位置。

  1. __device__ int devAr[1]; allocates fixed size array in device memory and stores the pointer to this device memory inside the devAr variable (hence the warning).
  2. The devAr address points to valid piece of device memory, however, such address can be used even in host code, because host and device memory use the addresses in the same format. However, in host code devAr points to some random uninitialized piece of host memory.
  3. Based on above one can say that devAr[0] = 4; just writes 4 into some random uninitialized location in host memory.

尝试运行以下代码,也许它可以帮助您了解实际情况:

Try running the following code, perhaps it will help you understand what is happening under the hood:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

using namespace std;

__device__ int devAr[1];

__global__ void foo()
{
    printf("dev: %d \n", devAr[0]);
    devAr[0] = 5;
    printf("dev: %d \n", devAr[0]);
}

int main()
{
    cout << "host: " << devAr[0] << endl;
    devAr[0] = 4;
    cout << "host: " << devAr[0] << endl;

    foo << <1, 1 >> >();
    cudaDeviceSynchronize();
    cout << "host: " << devAr[0] << endl;
}

输出将是:

host: 0
host: 4
dev: 0
dev: 5
host: 4

更新:

澄清后在下面的评论中您要问的问题我开始研究这个问题并发现了几个相关的SO线程,大多数引用来自答案下面的评论,它们是:

After clarifying what you are asking in the below comments I started digging in the issue and found couple of related SO threads, most of the quotations come from the comments below the answers, here they are:


  1. cudaMemcpy()vs cudaMemcpyFromSymbol()


任何静态定义的设备符号( __ device __ __ constant __ ,甚至纹理)导致工具链发出两个符号,一个在设备模块中,另一个在宿主对象中。 CUDA运行时在这两个符号之间建立并维护动态映射。符号API调用是检索 __ constant __ __ device __ 符号的映射的方式。纹理API检索纹理符号等的映射。

any statically defined device symbol (__device__, __constant__, even textures) results in the toolchain emitting two symbols, one in the device module, the other in the host object. The CUDA runtime sets up and maintains a dynamic mapping between these two symbols. The symbol API calls are the way of retrieving this mapping for __constant__ and __device__ symbols. The texture APIs retrieve the mapping for the texture symbols, etc.


  • 在CUDA中使用全局内存还是常量内存


    * PNT __ device __ 变量,而不是包含设备变量地址的主机变量。 (我知道这很令人困惑。)因此,如果您尝试像(void **)& PNT 一样在主机上访问它,则会尝试从不允许的主机。从主机代码的角度来看,它只是一个符号,因此您需要使用 cudaGetSympolAddress()将设备地址存储在主机变量中,然后可以将其传递给 cudaMemcpyToSymbol(),如@talonmies所示。

    *PNT is a __device__ variable, not a host variable containing the address of a device variable. (Confusing, I know.) Therefore if you try to access it on the host as with (void**)&PNT you are trying to read a device variable from the host which is not permitted. From the host code point of view it's just a symbol, so you need to use cudaGetSympolAddress() to store the device address in a host variable that you can then pass to cudaMemcpyToSymbol(), as @talonmies shows.


  • CUDA恒定内存错误


    有些令人困惑,主机代码中的A和B不是有效的设备内存地址。它们是主机符号,它们提供了到运行时设备符号查找的挂钩。将它们传递给内核是非法的-如果您想要它们的设备内存地址,则必须使用 cudaGetSymbolAddress 在运行时检索它。


  • cudaMemcpyToSymbol与cudaMemcpy为什么仍然存在(cudaMemcpyToSymbol)


    通过CUDA API复制到该地址将失败,并带有无效的参数错误,因为它不是该地址API先前已分配的GPU内存空间。是的,这适用于通用 __ device __ 指针和静态声明的设备符号。

    A copy to that address via the CUDA API would fail with an invalid argument error because it isn't an address in GPU memory space that the API had previously allocated. And yes, this applies to generic __device__ pointers and statically declared device symbols as well.


  • __ device __ 变量上的cudaMemcpyFromSymbol

  • cudaMemcpyFromSymbol on a __device__ variable:


    问题的根源是不允许使用普通主机代码中的设备变量地址:...尽管似乎正确编译,传递的实际地址是垃圾。要在主机代码中获取设备变量的地址,我们可以使用 cudaGetSymbolAddress


  • 基于这些证据,让我尝试从上面更新我原来的3步解释:

    Based on this evidence let me try to update my original 3step explanation from above:


    1. __ device__ int devAr [1]; 在设备内存中分配固定大小的数组,并将挂钩放入运行时设备符号查找存储到主机版本的 devAr 变量(请参阅链接的资源1和3)。

    2. devAr 地址只是一个垃圾。从主机的角度来看,应仅与符号API调用一起使用,例如 cudaGetSymbolAddress (所有链接的资源似乎都支持该理论),因为它映射到设备 devAr 变量的版本。

    1. __device__ int devAr[1]; allocates fixed size array in device memory and stores "hooks into a runtime device symbol lookup" into the host version of devAr variable (see linked resources 1 and 3).
    2. The devAr address is just a garbage from host's point of view and should only be used with the symbol API calls, such as cudaGetSymbolAddress (all of the linked resources appear to support this theory) because it maps to the device version of devAr variable.

    我无法提出任何建议更具体的,例如指向CUDA文档的链接,但我希望现在已经足够清楚了。总而言之,您现在似乎已经可以保证上述行为(例如,主机和设备版本为 devAr 变量),但对我而言,它似乎是除了符号API调用之外,您不应依赖且不应使用 devAr 变量的主机版本的实现细节。

    I was not able to come up with anything "more concrete" such as link to CUDA documentation but I hope this is now clear enough. All in all it seems like you now have a guarantee for the behavior described above (i.e. there is a host and device version of devAr variable) but to me it rather appears as an implementation detail which you should not rely on and should not use host version of devAr variable for purposes other than symbol API calls.

    这篇关于从主机访问CUDA全局设备变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆