内核使用AoS比使用SoA更快 [英] kernel using AoS is faster than using SoA

查看:392
本文介绍了内核使用AoS比使用SoA更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个版本的内核执行相同的任务 - 填充链接的单元格列表,两个内核之间的区别是数据类型存储粒子位置,第一个使用浮点数组来存储位置(4 float每个粒子由于128位读/写),第二个使用vec3f结构数组来存储位置(一个包含3个浮点数的结构)。



nvprof,我发现第二个内核(使用vec3f)的运行速度比第一个快:

 时间平均最小值最大值名称
42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int *,int *,int *,float *,int,_vec3f,_vec3f,_vec3i)
11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int *,int *,int *,_vec3f *,int,_vec3f,_vec3f,_vec3i)

这些测试是为了使用256和512000个粒子填充一个链接的单元格列表。



我的问题是,这里发生了什么?我认为浮动数组应该做一个更好的内存访问由于合并的内存,与使用vec3f结构数组有未对齐的内存。我想知道什么?



这些是内核,第一个内核:

  __ global__ void adentu_grid_cuda_filling_kernel(int * head,
int * linked,
int * cellnAtoms,
float * pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx> = nAtoms)
return;

vec3i cell;
vec3f _pos =(vec3f){(float)pos [idx * 4 + 0],(float)pos [idx * 4 + 1]

cell.x = floor((_pos.x - origin.x)/h.x);
cell.y = floor((_pos.y - origin.y)/h.y);
cell.z = floor((_pos.z - origin.z)/h.z);

int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

int i;
if(atomicCAS(& head [c],-1,idx)!= -1){
i = head [c]
while(atomicCAS(& linked [i],-1,idx)!= -1)
i = linked [i]
}
atomicAdd(& cellnAtoms [c],1);
}

这是第二个内核:

  __ global__ void adentu_grid_cuda_filling_kernel(int * head,
int * linked,
int * cellNAtoms,
vec3f * pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx> = nAtoms)
return;

vec3i cell;
vec3f _pos = pos [idx];

cell.x = floor((_pos.x - origin.x)/h.x);
cell.y = floor((_pos.y - origin.y)/h.y);
cell.z = floor((_pos.z - origin.z)/h.z);

int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

int i;
if(atomicCAS(& head [c],-1,idx)!= -1){
i = head [c]
while(atomicCAS(& linked [i],-1,idx)!= -1)
i = linked [i]
}
atomicAdd(& cellNAtoms [c],1);
}

这是vec3f结构:

  typedef struct _vec3f {float x,y,z} vec3f; 


解决方案

这不是AoS vs. SoA的例子。让我们看看代码中隐含的重要代码行和数据结构。



您的第一个SoA或slow案例:

  vec3f _pos =(vec3f){(float)pos [idx * 4 + 0],(float)pos [idx * 4 + 1] )pos [idx * 4 + 2]}; 
^ ^ ^
| | |
这些值存储在*相邻的*内存位置

pos [idx * 4] 加上其后的2个位置。这是一个结构如何存储!你称之为Arrays的结构实际上是一个结构数组,它被存储在内存中的方式。要有一个有效的SoA情况,你的代码将需要看起来像这样:

  vec3f _pos =(vec3f){ (float)pos1 [idx],(float)pos2 [idx],(float)pos3 [idx] 
^
|
相邻线程将读取pos1,pos2和pos3的相邻值
,导致* coalesced *访问。

您的AoS或fast实际上没有不同的存储格式。


I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).

Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:

 Time(%)      Time   Calls       Avg       Min       Max  Name
   42.88    37.26s       2    18.63s   23.97us    37.26s  adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
   11.00     3.93s       2     1.97s   25.00us     3.93s  adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)

The tests are done trying to fill a linked cell list using 256 and 512000 particles.

My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?

These are the kernels, the first kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellnAtoms,
                                                 float *pos, 
                                                 int nAtoms, 
                                                 vec3f origin, 
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};

    cell.x =  floor ((_pos.x - origin.x)/h.x);
    cell.y =  floor ((_pos.y - origin.y)/h.y);
    cell.z =  floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellnAtoms[c], 1);
}

And this is the second kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellNAtoms,
                                                 vec3f *pos,
                                                 int nAtoms,
                                                 vec3f origin,
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = pos[idx];

    cell.x = floor ((_pos.x - origin.x)/h.x);
    cell.y = floor ((_pos.y - origin.y)/h.y);
    cell.z = floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellNAtoms[c], 1);
}

This is the vec3f structure:

typedef struct _vec3f {float x, y, z} vec3f;

解决方案

This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.

Your first, "SoA" or "slow" case:

vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
                                      ^                    ^                    ^
                                      |                    |                    |
                               These values are stored in *adjacent* memory locations

So an individual thread is accessing successively pos[idx*4] plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:

vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
                                 ^
                                 |
               Adjacent threads will read adjacent values for pos1, pos2, and pos3
                    leading to *coalesced* access.

Your "AoS" or "fast" doesn't really have a different storage format.

这篇关于内核使用AoS比使用SoA更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆