使用CUDA矢量类型有优势吗? [英] Are there advantages to using the CUDA vector types?

查看:362
本文介绍了使用CUDA矢量类型有优势吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

CUDA提供了内置的向量数据类型,例如 uint2 uint4 等。使用这些数据类型有什么好处吗?

CUDA provides built-in vector data types like uint2, uint4 and so on. Are there any advantages to using these data types?

我们假设我有一个由两个值A和B组成的元组。一种将它们存储在内存中的方法是分配两个数组。第一个数组存储所有的A值,第二个数组存储所有的B值在对应于A值的索引。另一种方法是分配 uint2 类型的一个数组。我应该使用哪一个?推荐哪种方式?是 uint3 的会员 x y z 并列在内存中?

Let's assume that I have a tuple which consists of two values, A and B. One way to store them in memory is to allocate two arrays. The first array stores all the A values and the second array stores all the B values at indexes that correspond to the A values. Another way is to allocate one array of type uint2. Which one should I use? Which way is recommended? Does members of uint3 i.e x, y, z reside side by side in memory?

推荐答案

我主要熟悉Compute Capability 2.0(Fermi)。对于这种架构,我不认为使用矢量化类型有任何性能优势,除非可能是8位和16位类型。

I'm mainly familiar with Compute Capability 2.0 (Fermi). For this architecture, I don't think that there is any performance advantage to using the vectorized types, except maybe for 8- and 16-bit types.

char4的声明:

struct __device_builtin__ __align__(4) char4
{
    signed char x, y, z, w;
};

类型与4个字节对齐。我不知道 __ device_builtin __ 是什么。也许它在编译器中触发了一些魔法...

The type is aligned to 4 bytes. I don't know what __device_builtin__ does. Maybe it triggers some magic in the compiler...

对于 float1 的声明看起来有点奇怪, float2 float3 float4

Things look a bit strange for the declarations of float1, float2, float3 and float4:

struct __device_builtin__ float1
{
    float x;
};

__cuda_builtin_vector_align8(float2, float x; float y;);

struct __device_builtin__ float3
{
    float x, y, z;
};

struct __device_builtin__ __builtin_align__(16) float4
{
    float x, y, z, w;
};

float2 得到某种形式的特殊处理。 float3 是没有任何对齐的结构, float4 获得16个字节。我不知道该怎么做。

float2 gets some form of special treatment. float3 is a struct without any alignment and float4 gets aligned to 16 bytes. I'm not sure what to make of that.

全局内存事务是128字节,对齐到128字节。事务总是每次执行完整扭曲。当warp到达执行存储器事务的功能时,例如来自全局存储器的32位负载,芯片将在那时执行与为warp中的所有32个线程服务所需的事务一样多的事务。因此,如果所有访问的32位值在单个128字节线内,则仅需要一个事务。如果值来自不同的128字节行,则执行多个128字节事务。对于每个事务,warp被搁置大约600个周期,而数据从存储器中读取(除非它在L1或L2缓存中)。

Global memory transactions are 128 bytes, aligned to 128 bytes. Transactions are always performed for a full warp at a time. When a warp reaches a function that performs a memory transaction, say a 32-bit load from global memory, the chip will at that time perform as many transactions as are necessary for servicing all the 32 threads in the warp. So, if all the accessed 32-bit values are within a single 128-byte line, only one transaction is necessary. If the values come from different 128-byte lines, multiple 128-byte transactions are performed. For each transaction, the warp is put on hold for around 600 cycles while the data is fetched from memory (unless it's in the L1 or L2 caches).

认为找出什么类型的方法给出最佳性能的关键是考虑哪种方法导致最少的128字节存储器事务。

So, I think the key to finding out what type of approach gives the best performance, is to consider which approach causes the fewest 128-byte memory transactions.

假设内置向量类型只是结构,其中一些具有特殊对齐,使用向量类型使得值以交织方式存储在存储器(结构体数组)中。因此,如果warp在那时加载所有的 x 值,其他值( y ,<$ c由于128字节的事务,$ c> z , w )将被拉入L1。当warp经过尝试访问它们时,它们可能不再在L1中,因此必须发出新的全局内存事务。此外,如果编译器能够发出更宽的指令以同时读取更多的值,为了将来使用,它将使用寄存器来存储负载点和使用点之间的寄存器,可能增加寄存器使用

Assuming that the built in vector types are just structs, some of which have special alignment, using the vector types causes the values to be stored in an interleaved way in memory (array of structs). So, if the warp is loading all the x values at that point, the other values (y, z, w) will be pulled in to L1 because of the 128-byte transactions. When the warp later tries to access those, it's possible that they are no longer in L1, and so, new global memory transactions must be issued. Also, if the compiler is able to issue wider instructions to read more values in at the same time, for future use, it will be using registers for storing those between the point of the load and the point of use, perhaps increasing the register usage of the kernel.

另一方面,如果这些值被打包到数组的结构中,那么可以用尽可能少的事务来处理负载。因此,从 x 数组读取时,只有 x 值被加载到128字节的事务中。这可能导致较少的事务,较少依赖于缓存,以及计算和内存操作之间的更均匀分布。

On the other hand, if the values are packed into a struct of arrays, the load can be serviced with as few transactions as possible. So, when reading from the x array, only x values are loaded in the 128-byte transactions. This could cause fewer transactions, less reliance on the caches and a more even distribution between compute and memory operations.

这篇关于使用CUDA矢量类型有优势吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆