如何在CUDA应用程序中构建最佳速度的数据 [英] How to structure data for optimal speed in a CUDA app

查看:122
本文介绍了如何在CUDA应用程序中构建最佳速度的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写一个简单的粒子系统,利用CUDA做更新的粒子位置。现在我定义一个粒子有一个对象,一个位置用三个浮点值定义,一个速度也用三个浮点值定义。当更新粒子时,我将一个常数值添加到速度的Y分量以模拟重力,然后将速度添加到当前位置以得到新的位置。在内存管理方面,最好保持两个单独的浮点数组来存储数据或以面向对象的方式结构化。像这样:

  struct Vector 
{
float x,y,z;
};

struct粒子
{
向量位置;
矢量速度;
};

看起来数据大小与两种方法相同浮点数每个向量,2个向量每个粒子总共24个字节总)看起来像OO方法将允许CPU和GPU之间更有效的数据传输,因为我可以使用单个内存副本语句而不是2(和长期以来,因为还有其他一些关于粒子的信息,比如Age,Lifetime,Weight / Mass,Temperature等等)。然后,theres也只是代码的简单可读性和处理它的容易性,这也使我倾向于OO接近。但我看到的例子不使用结构化数据,所以它让我想知道这是一个原因。



所以问题是哪个更好:个别数据数组或结构化对象?

解决方案

在数据并行编程中常用的数组结构结构(AOS),其中您的两个示例中的第一个是AOS,第二个是SOA。许多并行编程范例,特别是SIMD式范式,将更喜欢SOA。在GPU编程中,SOA通常是首选的原因是优化对全局的访问记忆。您可以在去年的GTC 高级CUDA C 上查看录制的演示文稿,详细说明如何GPU访问内存。



要点是内存事务的最小大小为32字节,并且您希望最大化每个事务的效率。



使用AOS:

  position [base + tid] .x = tid] .x + velocity [base + tid] .x * dt; 
// ^写入每三个地址^从每第三个地址读取
// ^从每个第三个地址读取

使用SOA:

  position.x [base + tid] = position.x [base + tid] + velocity.x [base + tid] * dt; 
// ^写入连续地址^从连续地址读取
// ^从连续地址读取

在第二种情况下,从连续地址读取意味着您的效率为100%,而第一种情况下为33%。请注意,在较旧的GPU(计算能力1.0和1.1)上,情况会更糟(效率为13%)。



还有另一种可能性 - 如果你有两个或四个浮动在结构中,那么你可以读取AOS以100%的效率:

  float4 lpos; 
float4 lvel;
lpos = position [base + tid];
lvel = velocity [base + tid];
lpos.x + = lvel.x * dt;
// ...
位置[base + tid] = lpos;

有关详细信息,请参阅高级CUDA C演示文稿。


I am attempting to write a simple particle system that leverages CUDA to do the updating of the particle positions. Right now I am defining a particle has an object with a position defined with three float values, and a velocity also defined with three float values. When updating the particles, I am adding a constant value to the Y component of the velocity to simulate gravity, then adding the velocity to the current position to come up with the new position. In terms of memory management is it better to maintain two separate arrays of floats to store the data or to structure in a object oriented way. Something like this:

struct Vector
{
    float x, y, z;
};

struct Particle
{
    Vector position;
    Vector velocity;
};

It seems like the size of the data is the same with either method (4 bytes per float, 3 floats per Vector, 2 Vectors per Particle totaling 24 bytes total) It seems like the OO approach would allow more effiecient data transfer between the CPU and GPU because I could use a single Memory copy statement instead of 2 (and in the long run more, as there are a few other bits of information about particles that will become relevant, like Age, Lifetime, Weight/Mass, Temperature, etc) And then theres also just the simple readability of the code and ease of dealing with it that also makes me inclined toward the OO approach. But the examples I have seen don't utilize structured data, so it makes me wonder if theres a reason.

So the question is which is better: individual arrays of data or structured objects?

解决方案

It's common in data parallel programming to talk about "Struct of Arrays" (SOA) versus "Array of Structs" (AOS), where the first of your two examples is AOS and the second is SOA. Many parallel programming paradigms, in particular SIMD-style paradigms, will prefer SOA.

In GPU programming, the reason that SOA is typically preferred is to optimise the accesses to the global memory. You can view the recorded presentation on Advanced CUDA C from GTC last year for a detailed description of how the GPU accesses memory.

The main point is that memory transactions have a minimum size of 32 bytes and you want to maximise the efficiency of each transaction.

With AOS:

position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
//  ^ write to every third address                    ^ read from every third address
//                           ^ read from every third address

With SOA:

position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
//  ^ write to consecutive addresses                  ^ read from consecutive addresses
//                           ^ read from consecutive addresses

In the second case, reading from consecutive addresses means that you have 100% efficiency versus 33% in the first case. Note that on older GPUs (compute capability 1.0 and 1.1) the situation is much worse (13% efficiency).

There is one other possibility - if you had two or four floats in the struct then you could read the AOS with 100% efficiency:

float4 lpos;
float4 lvel;
lpos = position[base + tid];
lvel = velocity[base + tid];
lpos.x += lvel.x * dt;
//...
position[base + tid] = lpos;

Again, check out the Advanced CUDA C presentation for the details.

这篇关于如何在CUDA应用程序中构建最佳速度的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆