与AVX / AVX2内在函数进行对齐和未对齐的内存访问 [英] Aligned and unaligned memory access with AVX/AVX2 intrinsics

查看:1654
本文介绍了与AVX / AVX2内在函数进行对齐和未对齐的内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据英特尔软件开发人员手册(第14.9节),AVX放宽了内存访问的对齐要求。如果数据直接加载到处理指令中,例如

  vaddps ymm0,ymm0,YMMWORD PTR [rax] 

加载地址不必对齐。但是,如果使用专用的对齐加载指令,如

  vmovaps ymm0,YMMWORD PTR [rax] 

加载地址必须对齐(为32的倍数),否则会引发异常。



让我感到困惑的是从内在函数自动生成代码,在我的情况下是gcc / g ++(4.6.3,Linux)。请看下面的测试代码:

  #include< x86intrin.h> 
#include< stdio.h>
#include< stdlib.h>
#include< assert.h>
$ b #define SIZE(1L << 26)
#define OFFSET 1

int main(){
float * data;
assert(!posix_memalign((void **)& data,32,SIZE * sizeof(float)));
for(unsigned i = 0; i< SIZE; i ++)data [i] = drand48();
float res [8] __attribute__((aligned(32)));
__m256 sum = _mm256_setzero_ps(),elem;
for(float * d = data + OFFSET; d elem = _mm256_load_ps(d);
// sum = _mm256_add_ps(elem,elem);
sum = _mm256_add_ps(sum,elem);
}
_mm256_store_ps(res,sum);
for(int i = 0; i <8; i ++)printf(%g,res [i]);的printf( \\\
);
返回0;





(是的,我知道代码有问题,因为我使用了一个对齐的负载在未对齐的地址上,但忍受着我......)

我用

编译代码,  g ++ -Wall -O3 -march = native -o memtest memtest.C 

在带有AVX的CPU上。如果我使用

  objdump -S -M intel-mnemonic memtest |更多

我发现编译器不会生成对齐的加载指令,而是直接在向量添加指令:

  vaddps ymm0,ymm0,YMMWORD PTR [rax] 

即使内存地址未对齐(OFFSET为1),代码也可以毫无问题地执行。这是很明显的,因为vaddps容忍未对齐的地址。



如果我用第二个附加内在注释行,编译器无法融合负载和添加,因为vaddps只能有单一内存源操作数,并生成:

  vmovaps ymm0,YMMWORD PTR [rax] 
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0

现在程序分段错误,使用对齐的加载指令,但内存地址未对齐。 (顺便说一句,如果我使用_mm256_loadu_ps,或者我将OFFSET设置为0,则程序不会分段错误。)



这使程序员处于我的问题是:有没有办法强制C编译器生成一个直接的负载在一个编译器,并使行为部分不可预知,在我的愚见。

处理指令(如vaddps)或生成一个专用的加载指令(如vmovaps)?

解决方案

没有办法用内在函数显式控制负载的折叠。我认为这是内在的弱点。如果你想明确地控制折叠,那么你必须使用装配。



在以前的GCC版本中,我可以使用对齐或未对齐的方式来控制折叠加载。然而,这似乎不再是这种情况(GCC 4.9.2)。我的意思是例如在 AddDot4x4_vec_block_8wide 这里这个函数中,

  vmulps ymm9,ymm0,YMMWORD PTR [rax-256] 
vaddps ymm8,ymm9,ymm8

不过-gvc-and-gcc-for-highly-optimized-matrix-multp>在以前的GCC版本中加载未被折叠:

<$ p $ ymm8,ymm8,ymm9
vmovups ymm9,YMMWORD PTR [rax-256]
vmulps ymm9,ymm0,ymm9
vaddps ymm8, / pre>

显然,正确的解决方案只有在您知道数据已对齐并且您确实要显式控制折叠用途程序集时才使用对齐的加载。 / p>

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.

vaddps ymm0,ymm0,YMMWORD PTR [rax]

the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as

vmovaps ymm0,YMMWORD PTR [rax]

the load address has to be aligned (to multiples of 32), otherwise an exception is raised.

What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1L << 26)
#define OFFSET 1

int main() {
  float *data;
  assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
  for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
  float res[8]  __attribute__ ((aligned(32)));
  __m256 sum = _mm256_setzero_ps(), elem;
  for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
    elem = _mm256_load_ps(d);
    // sum = _mm256_add_ps(elem, elem);
    sum = _mm256_add_ps(sum, elem);
  }
  _mm256_store_ps(res, sum);
  for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
  return 0;
}

(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)

I compile the code with

g++ -Wall -O3 -march=native -o memtest memtest.C

on a CPU with AVX. If I check the code generated by g++ by using

objdump -S -M intel-mnemonic memtest | more

I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:

vaddps ymm0,ymm0,YMMWORD PTR [rax]

The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.

If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:

vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0

And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)

This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.

My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?

解决方案

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.

In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded

vmulps  ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps  ymm8, ymm9, ymm8

However in a previous verison of GCC the loads were not folded:

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.

这篇关于与AVX / AVX2内在函数进行对齐和未对齐的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆