cuda内存对齐 [英] cuda memory alignment

查看:548
本文介绍了cuda内存对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的代码中,我使用结构以便于将争论传递给函数(我不使用结构数组,而是使用数组结构)。
当我在cuda-gdb,我检查一个内核中的点,我给一个简单的结构如

  struct pt {
int i;
int j;
int k;
}

即使我不是做复杂的事情,

要求堆栈的位置0,堆栈只有0个元素。


所以我想,即使它不是一个数组,也许在这一点上的内存对齐有一个问题。所以我把头文件中的定义更改为

  struct __align __(16)pt {
int i;
int j;
int k;
}

,但是当编译器试图编译使用的主机代码文件相同的定义,给出以下错误:


错误:数字常量错误之前的预期未限定ID:expected
'之前的数字常量错误:期望的构造函数,析构函数,
或类型转换之前';'token


对于主机和设备结构有两个不同的定义



此外,我想问如何推广对齐逻辑。我不是一个计算机科学家,所以在编程指南中的两个例子不帮助我得到的大图片。



例如,以下两个如何对齐?或者,6个浮子的结构应该如何排列?或4个整数?再次,我不使用那些数组,但是我仍然在内核或_ device _函数中定义了这些结构的很多变量。

  struct {
int a;
int b;
int c;
int d;
float * el
};

struct {
int a;
int b
int c
int d
float * i;
float * j;
float * k;
};先感谢您提供任何建议或提示


h2_lin>解决方案

这篇文章有很多问题。由于CUDA编程指南在解释CUDA中的对齐方面做得相当不错,我将仅解释指南中不明显的一些事情。



首先,原因你的主机编译器给你错误是因为主机编译器不知道 __ align(n)__ 意味着什么,所以它给出一个语法错误。

  #if defined(__ CUDACC__)// NVCC 
#define MY_ALIGN(n)__align __(n)
#elif defined(__ GNUC__)// GCC
#define MY_ALIGN(n)__attribute __((aligned(n)))
#elif defined(_MSC_VER)// MSVC
#define MY_ALIGN(n)__declspec(align(n))
#else
#error请为您的主机编译器提供MY_ALIGN宏的定义!
#endif




主机和设备结构的不同定义?


不,只需使用 MY_ALIGN(n),像这样

  struct MY_ALIGN(16)pt {int i,j,k; } 




例如,如何对齐以下两个?


首先, __ align(n)__ flavors),强制结构的内存从内存中的一个地址开始,该地址是 n 个字节的倍数。如果结构体的大小不是 n 的倍数,那么在这些结构体的数组中,将插入填充以确保每个结构体正确对齐。要为 n 选择合适的值,您希望最小化所需的填充量。如编程指南中所述,硬件要求每个线程读取对齐到1,2,4,8或16字节的字。所以...

  struct MY_ALIGN(16){
int a;
int b;
int c;
int d;
float * el;
};

在这种情况下,我们选择16字节对齐。在32位机器上,指针需要4个字节,因此结构需要20个字节。 16字节对齐会浪费每个结构的 16 *(ceil(20/16) - 1)= 12 字节。在64位机器上,由于8字节指针,每个结构只浪费8个字节。我们可以使用 MY_ALIGN(8)来减少浪费。折衷是硬件将必须使用3个8字节加载而不是2个16字节加载从内存加载结构。如果你不是由负载瓶颈,这可能是一个值得的权衡。请注意,您不希望此结构的小于4个字节对齐。

  struct MY_ALIGN(16){
int a;
int b
int c
int d
float * i;
float * j;
float * k;
};

在这种情况下,对于16字节对齐,每个结构在32位机器上只浪费4个字节,或64位机器上的8。它需要两个16字节的负载(或64位机器上的3)。如果我们对齐到8字节,我们可以完全消除浪费4字节对齐(64位机器上的8字节),但这将导致过多的负载。


或者,如何使具有6个浮点数的结构对齐?


同样,折衷:每个结构浪费8个字节或每个结构需要两个负载。


或4个整数?


这里没有权衡。 MY_ALIGN(16)


,但是我仍然在内核或_ device _函数中定义了很多带有这些结构的变量。


嗯,如果你不使用这些的数组,那么你可能不需要对齐。但是你怎么分配给他们呢?正如你可能看到的,所有的浪费是重要的担心—这是另一个很好的理由喜欢数组结构数组。


In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general). When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like

struct pt{
int i;
int j;
int k;
}

even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...

Asked for position 0 of stack, stack only has 0 elements on it.

So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to

struct __align__(16) pt{
int i;
int j;
int k;
}

but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:

error: expected unqualified-id before numeric constant error: expected ‘)’ before numeric constant error: expected constructor, destructor, or type conversion before ‘;’ token

so, am I supposed to have two different definitions for host and device structures ???

Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.

For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

struct {
    int a;
    int b;
    int c;
    int d;
    float* el;    
} ;

 struct {
    int a;
    int b
    int c
    int d
    float* i;
    float* j;
    float* k;
} ;

Thank you in advance for any advice or hints

解决方案

There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.

First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.

#if defined(__CUDACC__) // NVCC
   #define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
  #define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
  #define MY_ALIGN(n) __declspec(align(n))
#else
  #error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif

So, am I supposed to have two different definitions for host and device structures?

No, just use MY_ALIGN(n), like this

struct MY_ALIGN(16) pt { int i, j, k; }

For example, how should the following two be aligned?

First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...

struct MY_ALIGN(16) {
  int a;
  int b;
  int c;
  int d;
  float* el;    
};

In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.

struct MY_ALIGN(16) {
  int a;
  int b
  int c
  int d
  float* i;
  float* j;
  float* k;
};

In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.

or, how should a structure with 6 floats be aligned?

Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.

or 4 integers?

No tradeoff here. MY_ALIGN(16).

again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.

这篇关于cuda内存对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆