如何在JCuda中将结构传递给内核 [英] How can I pass a struct to a kernel in JCuda

查看:135
本文介绍了如何在JCuda中将结构传递给内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经查看了此 http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html ,其中说我必须修改内核以仅采用一维数组.但是,我拒绝相信不可能在JCuda中创建结构并将其复制到设备内存中.

I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda.

我想通常的实现是创建一个扩展某些本机api的case类(scala术语),然后可以将其转换为可以安全地传递到内核的结构.不幸的是,我还没有在Google上找到任何东西,因此是一个问题.

I would imagine the usual implementation would be to create a case class (scala terminology) that extends some native api, which can then be turned into a struct that can be safely passed into the kernel. Unfortunately I haven't found anything on google, hence the question.

推荐答案

(此处是JCuda的作者(请不要使用"JCUDA"))

(The author of JCuda here (not "JCUDA", please))

如论坛文章中的评论所链接:在CUDA内核中使用结构并从JCuda端填充它们并非不可能.这只是非常复杂,而且几乎没有好处.

As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial.

由于在GPU编程中根本很少使用结构的原因,您必须参考在搜索

For the reason of why it is rarely beneficial to use structs at all in GPU programming, you will have to refer to the results that you'll find when you search about the differences between

结构数组"与结构数组".

"Array Of Structures" versus "Structure Of Arrays".

通常,由于改进了内存合并,后者是GPU计算的首选,但这超出了我在此答案中可以深刻总结的范围.在这里,我仅概述为什么在GPU计算中使用结构通常会有点困难,而在JCuda/Java中尤其困难.

Usually, the latter is preferred for GPU computations, due to improved memory coalescing, but this is beyond what I can profoundly summarize in this answer. Here, I will only summarize why using structs in GPU computing is a bit difficult in general, and particularly difficult in JCuda/Java.

在普通C语言中,关于内存布局,结构(从理论上来说)非常简单.想象一个像

In plain C, structs are (theoretically!) very simple, regarding the memory layout. Imagine a structure like

struct Vertex {
    short a;
    float x;
    float y;
    float z;
    short b;
};

现在您可以创建这些结构的数组:

Now you can create an array of these structs:

Vertex* vertices = (Vertex*)malloc(n*sizeof(Vertex));

这些结构将被保证被布置为一个连续的内存块:

These structs will be guaranteed to be are laid out as one contiguous memory block:

            |   vertices[0]      ||   vertices[1]      |
            |                    ||                    |
vertices -> [ a|  x |  y |  z | b][ a|  x |  y |  z | b]....

由于CUDA内核和C代码是使用同一编译器进行编译的,因此没有太多的回旋余地.主机方说这里有一些内存,将其解释为Vertex对象",内核将接收相同的内存并对其进行处理.

Since the CUDA kernel and the C code are compiled with the same compiler, there is not much room for musinderstandings. The host side says "Here is some memory, interpret this as Vertex objects", and the kernel will receive the same memory and work with it.

尽管如此,即使在普通C语言中,实际上也存在一些潜在的意外问题.编译器通常会将填充引入这些结构中,以实现某些对齐.因此,示例结构实际上可能具有这样的布局:

Still, even in plain C, there is in practice some potential for unexpected problems. Compilers will often introduce paddings into these structs, to achieve certain alignments. The example structure might thus in fact have a layout like this:

struct Vertex {
    short a;        // 2 bytes
    char PADDING_0  // Padding byte
    char PADDING_1  // Padding byte
    float x;        // 4 bytes
    float y;        // 4 bytes
    float z;        // 4 bytes
    short b;        // 2 bytes
    char PADDING_2  // Padding byte
    char PADDING_3  // Padding byte
};

可以执行类似的操作以确保结构对齐到32位(4字节)字边界.此外,某些编译指示和编译器指令可能会影响此对齐方式. CUDA另外更喜欢某些内存对齐方式,因此这些指令在CUDA标头中大量使用.

Something like this may done in order to make sure that the structures are aligned to 32bit (4byte) word boundaries. Moreover, there are certain pragmas and compiler directives that may influence this alignment. CUDA additionally prefers certain memory alignments, and therefore these directives are used heavily in the CUDA headers.

简而言之:当您在C语言中定义struct,然后将sizeof(YourStruct)(或该结构的实际 layout )打印到控制台时,您将很难预测它将实际打印的内容.期待一些惊喜.

For short: When you define a struct in C, and then print the sizeof(YourStruct) (or the actual layout of the struct) to the console, you will have a hard time to predict what it will actually print. Expect some surprises.

在JCuda/Java中,情况有所不同.根本没有struct.当您创建类似

In JCuda/Java, the world is different. There simply are no structs. When you create a Java class like

class Vertex {
    short a;
    float x;
    float y;
    float z;
    short b;
}

然后创建这些数组

Vertex vertices[2] = new Vertex[2];
vertices[0] = new Vertex();
vertices[1] = new Vertex();

然后可以将这些Vertex对象随意散布在内存中.您甚至都不知道一个Vertex对象有多大,并且几乎无法找到它.因此,试图在JCuda中创建结构数组并将其传递给CUDA内核根本没有道理.

then the these Vertex objects may be arbirarily scattered in memory. You don't even know how large one Vertex object is, and will hardly be able to find it out. Thus, trying to create an array of structures in JCuda and pass it to a CUDA kernel simply does not make sense.

但是,如上所述:仍然可能以某种形式出现. 如果您知道结构将在CUDA内核中使用的内存布局,则可以创建与该结构布局兼容"的内存块,并从Java端填充它.对于上述struct Vertex之类的东西,它可能大致 (涉及一些伪代码)如下:

However, as mentioned above: It is still possible, in some form. If you know the memory layout that your structures will have in the CUDA kernel, then you can create a memory block that is "compatible" with this structure layout, and fill it from Java side. For something like the struct Vertex mentioned above, this could roughly (involving some pseudocode) look like this:

// 1 short + 3 floats + 1 short, no paddings
int sizeOfVertex = 2 + 4 + 4 + 4 + 2; 

// Allocate data for 2 vertices
ByteBuffer data = ByteBuffer.allocateDirect(sizeOfVertex * 2);

// Set vertices[0].a and vertices[0].x and vertices[0].y
data.position(0).asShortBuffer().put(0, a0);
data.position(2).asFloatBuffer().put(0, x0);
data.position(2).asFloatBuffer().put(1, y0);

// Set vertices[1].a and vertices[1].x and vertices[1].y
data.position(sizeOfVertex+0).asShortBuffer().put(0, a1);
data.position(sizeOfVertex+2).asFloatBuffer().put(0, x1);
data.position(sizeOfVertex+2).asFloatBuffer().put(1, y1);

// Copy the Vertex data to the device
cudaMemcpy(deviceData, Pointer.to(data), cudaMemcpyHostToDevice);

它基本上可以归结为将内存保存在ByteBuffer中,并手动访问与所需结构的所需字段相对应的内存区域.

It basically boils down to keeping the memory in a ByteBuffer, and to manually access the memory regions that correspond to the desired fields of the desired structs.

但是,警告:您必须考虑以下可能性,即在几种CUDA-C编译器版本或平台之间不能完美移植.在32位Linux机器上和64位Windows机器上一次编译内核(包含struct定义)时,结构布局可能是不同的(并且Java代码将具有意识到这一点.)

However, a warning: You have to consider the possibility that this will not be perfectly portable among several CUDA-C compiler versions or platforms. When you compile your kernel (that contains the struct definition) once on a 32bit Linux machine and once on a 64 bit Windows machine, then the structure layout might be different (and your Java code would have to be aware of this).

(注意:您可以定义接口以简化这些访问.对于 JOCL ,我试图创建一些实用程序类,使其在某种程度上更像C结构,并在某种程度上实现了复制过程的自动化,但无论如何,与纯C相比,这将很不方便(并且无法实现真正​​的良好性能)

(Note: One could define interface to simplify these accesses. And for JOCL, I tried to create utility classes that feel a bit more like C structs and automate the copying process to some extent. But in any case, it will be inconvenient (and not achieve a really good performance) compared to plain C)

这篇关于如何在JCuda中将结构传递给内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆