内核使用AoS比使用SoA更快 [英] kernel using AoS is faster than using SoA
问题描述
nvprof,我发现第二个内核(使用vec3f)的运行速度比第一个快:
时间平均最小值最大值名称
42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int *,int *,int *,float *,int,_vec3f,_vec3f,_vec3i)
11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int *,int *,int *,_vec3f *,int,_vec3f,_vec3f,_vec3i)
这些测试是为了使用256和512000个粒子填充一个链接的单元格列表。
我的问题是,这里发生了什么?我认为浮动数组应该做一个更好的内存访问由于合并的内存,与使用vec3f结构数组有未对齐的内存。我想知道什么?
这些是内核,第一个内核:
__ global__ void adentu_grid_cuda_filling_kernel(int * head,
int * linked,
int * cellnAtoms,
float * pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx> = nAtoms)
return;
vec3i cell;
vec3f _pos =(vec3f){(float)pos [idx * 4 + 0],(float)pos [idx * 4 + 1]
cell.x = floor((_pos.x - origin.x)/h.x);
cell.y = floor((_pos.y - origin.y)/h.y);
cell.z = floor((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if(atomicCAS(& head [c],-1,idx)!= -1){
i = head [c]
while(atomicCAS(& linked [i],-1,idx)!= -1)
i = linked [i]
}
atomicAdd(& cellnAtoms [c],1);
}
这是第二个内核:
__ global__ void adentu_grid_cuda_filling_kernel(int * head,
int * linked,
int * cellNAtoms,
vec3f * pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx> = nAtoms)
return;
vec3i cell;
vec3f _pos = pos [idx];
cell.x = floor((_pos.x - origin.x)/h.x);
cell.y = floor((_pos.y - origin.y)/h.y);
cell.z = floor((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if(atomicCAS(& head [c],-1,idx)!= -1){
i = head [c]
while(atomicCAS(& linked [i],-1,idx)!= -1)
i = linked [i]
}
atomicAdd(& cellNAtoms [c],1);
}
这是vec3f结构:
typedef struct _vec3f {float x,y,z} vec3f;
这不是AoS vs. SoA的例子。让我们看看代码中隐含的重要代码行和数据结构。
您的第一个SoA或slow案例:
vec3f _pos =(vec3f){(float)pos [idx * 4 + 0],(float)pos [idx * 4 + 1] )pos [idx * 4 + 2]};
^ ^ ^
| | |
这些值存储在*相邻的*内存位置
pos [idx * 4]
加上其后的2个位置。这是一个结构如何存储!你称之为Arrays的结构实际上是一个结构数组,它被存储在内存中的方式。要有一个有效的SoA情况,你的代码将需要看起来像这样:
vec3f _pos =(vec3f){ (float)pos1 [idx],(float)pos2 [idx],(float)pos3 [idx]
^
|
相邻线程将读取pos1,pos2和pos3的相邻值
,导致* coalesced *访问。
您的AoS或fast实际上没有不同的存储格式。
I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).
Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:
Time(%) Time Calls Avg Min Max Name
42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)
The tests are done trying to fill a linked cell list using 256 and 512000 particles.
My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?
These are the kernels, the first kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellnAtoms,
float *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellnAtoms[c], 1);
}
And this is the second kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellNAtoms,
vec3f *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = pos[idx];
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellNAtoms[c], 1);
}
This is the vec3f structure:
typedef struct _vec3f {float x, y, z} vec3f;
This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.
Your first, "SoA" or "slow" case:
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
^ ^ ^
| | |
These values are stored in *adjacent* memory locations
So an individual thread is accessing successively pos[idx*4]
plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:
vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
^
|
Adjacent threads will read adjacent values for pos1, pos2, and pos3
leading to *coalesced* access.
Your "AoS" or "fast" doesn't really have a different storage format.
这篇关于内核使用AoS比使用SoA更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!