如何使用CUDA进行结构的深度复制? [英] How to perform deep copying of struct with CUDA?

查看:169
本文介绍了如何使用CUDA进行结构的深度复制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与我面临的一个问题想一些数据从主机复制到GPU的CUDA编程。

Programming with CUDA I am facing a problem trying to copy some data from host to gpu.

我有3个嵌套的结构这样的:

I have 3 nested struct like these:

typedef struct {
    char data[128];
    short length;
} Cell;

typedef struct {
    Cell* elements;
    int height;
    int width;
} Matrix;

typedef struct {
    Matrix* tables;
    int count;
} Container;

所以集装箱包括一些矩阵元素,而这又包括一些细胞元素。

So Container "includes" some Matrix elements, which in turn includes some Cell elements.

让我们假设我这样动态分配主机内存:

Let's suppose I dynamically allocate the host memory in this way:

Container c;
c.tables = malloc(20 * sizeof(Matrix));

for(int i = 0;i<20;i++){
    Matrix m;
    m.elements = malloc(100 * sizeof(Cell));
    c.tables[i] = m;
}

即,20矩阵每100细胞的容器

That is, a Container of 20 Matrix of 100 Cells each.


  • 我怎么能现在用这个cudaMemCpy数据复制到设备存储器()?

  • 是否有从主机执行结构的结构深拷贝到设备什么好的办法?

感谢您的时间。

安德烈

推荐答案

简短的回答是就是不。有四个原因,我说:

The short answer is "just don't". There are four reasons why I say that:


  1. 有一个在API
  2. 没有深刻的复制功能
  3. 由此产生的code,你将不得不的writeTo设置和复制你所描述的GPU的结构将至少可笑复(约4000 API调用,并可能是一个中间的内核为你的20基质100例如细胞)

  4. 使用三级间接指针的GPU code将正大量地增加内存访问延迟,并会破坏什么小高速缓存一致性可在GPU上

  5. 如果您希望将数据传回后复制到主机,你有反向
  6. 同样的问题
  1. There is no deep copy functionality in the API
  2. The resulting code you will have to writeto set up and copy the structure you have described to the GPU will be ridiculously complex (about 4000 API calls at a minimum, and probably an intermediate kernel for your 20 Matrix of 100 Cells example)
  3. The GPU code using three levels of pointer indirection will have massively increased memory access latency and will break what little cache coherency is available on the GPU
  4. If you want to copy the data back to the host afterwards, you have the same problem in reverse

考虑使用线性存储和索引来代替。它是主机和GPU,以及分配之间便携式和复制的开销是指针基替代的约1%。

Consider using linear memory and indexing instead. It is portable between host and GPU, and the allocation and copy overhead is about 1% of the pointer based alternative.

如果您真的要做到这一点,发表评论,我会尽量挖掘一些老code例子,说明什么是完整的愚蠢嵌套指针在GPU上。

If you really want to do this, leave a comment and I will try and dig up some old code examples which show what a complete folly nested pointers are on the GPU.

这篇关于如何使用CUDA进行结构的深度复制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆