为什么在AMD64上未对齐访问mmap内存有时会出现段错误? [英] Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

查看:378
本文介绍了为什么在AMD64上未对齐访问mmap内存有时会出现段错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



  #include< p>< p> inttypes.h> 
#include< stdlib.h>

#include< sys / mman.h>

int main()
{
uint32_t sum = 0;
uint8_t * buffer = mmap(NULL,1 << 18,PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS,-1,0);
uint16_t * p =(buffer + 1);
int i; (i = 0; i <14; ++ i){
// printf(%d \ n,i);


sum + = p [i];
}

归还金额;



$ b $ p
$ b

如果内存分配使用 mmap 。如果我使用 malloc ,堆栈上的缓冲区或全局变量,它不会出现段错误。



如果我将循环的迭代次数减少到小于14的任何值,不再是段错误。如果我从循环中打印数组索引,它也不再是段错误。



为什么访问未对齐地址的CPU上的未对齐内存访问segfault,以及为什么只有在这种特定的情况下?

解决方案

gcc4.8创建了一个试图达到对齐边界的序言,但它假定 uint16_t * p 是2字节对齐的,即一些标量迭代将使指针16字节对齐。 b
$ b

我不认为gcc有意支持x86上的未对齐指针,它恰好适用于没有自动向量化的非原子类型。在ISO C中,使用指向 uint16_t 且指向小于 alignof(uint16_t)= 2 对齐的指针肯定是未定义的行为。 GCC不会在编译时看到你违反规则时发出警告,而且实际上恰好会产生工作代码(对于 malloc ,它知道返回值最小对齐) ,但这是大概只是海湾合作委员会内部事故,不应被视为支持的表示。




尝试 -O3 -fno-tree-vectorize -O2 。如果我的解释是正确的,那就不会出现段错误,因为它只会使用标量负载(正如你在x86上所说的没有任何对齐要求)。






gcc知道 malloc 返回此目标上的16字节对齐内存(x86-64 Linux,其中 maxalign_t 的宽度为16个字节,因为 long double 在x86-64 System V ABI中填充为16个字节)。它看到你在做什么,并使用 movdqu



但gcc不会将 mmap 作为内建函数,所以它不知道它返回的是页面对齐的内存,并且应用了它通常的自动矢量化策略,这显然假定 uint16_t * p 是2字节对齐的,因此在处理未对齐后可以使用 movdqa 。您的指针未对齐,违反了这一假设。



(我想知道新的glibc头文件是否使用 __属性__((alloc_align(4096))) mmap 的返回值标记为对齐。这将是一个好主意,并且可能会给出与<$ $相同的代码生成代码c code $>


lock $ $ $ b pre $在CPU上能够访问未对齐的

SSE2 movdqa segfaults on unaligned,and your elements本身没有对齐,所以你有一个不寻常的情况,即没有数组元素以16字节的边界开始。



SSE2是x86-64的基准,因此gcc使用它。 / p>




Ubuntu 14.04LTS使用gcc4.8.2 (偏离主题:这是旧的和过时的,在许多情况下比gcc5.4或gcc6.4更糟糕的代码,特别是在自动矢量化时,它甚至不识别) -march = haswell 。)

14是gcc启发式决定自动化的最小阈值 - 使用 -O3 并且不存在 -march -mtune 选项。



我把你的代码放在 on Godbolt ,这是 ma的相关部分in

  call mmap#
lea rdi,[rax + 1]#p ,
mov rdx,rax#buffer,
mov rax,rdi#D.2507,p
和eax,15#D.2507,
shr rax ##### rax>> = 1丢弃低字节,假设它为零
neg rax#D.2507
mov esi,eax#prolog_loop_niters.7,D.2507
和esi,7#prolog_loop_niters .7,
je .L2
#.L2直接导致MOVDQA xmm2,[rdx + 1]

它计算出(用这块代码)在到达MOVDQA之前要做多少标量迭代,但没有任何代码路径导致MOVDQU循环。即gcc没有代码路径来处理 p 为奇数的情况。






但是malloc的代码如下所示:

  call malloc#
movzx edx,WORD PTR [rax + 17]#D.2497,MEM [(uint16_t *)buffer_5 + 17B]
movzx ecx,WORD PTR [rax + 27]#D.2497, MEM [(uint16_t *)buffer_5 + 27B]
movdqu xmm2,XMMWORD PTR [rax + 1]#tmp91,MEM [(uint16_t *)buffer_5 + 1B]
pre>

请注意使用 movdqu 。还有一些标量 movzx 载入混合:总共14次迭代中有8次是SIMD,剩下的6次是标量。这是一个错过优化:它可以轻松地使用 movq 加载来执行另一个4,尤其是因为在将
解包为零以填充XMM向量以获得uint32_t元素之前添加。



(还有其他各种遗漏优化,如可能使用 pmaddwd 和乘数 1 将单词的水平对添加到双字元素中。)






安全带有未对齐指针的代码:



如果您确实想要编写使用未对齐指针的代码,您可以在ISO C中使用 memcpy 。在具有高效未对齐加载支持(如x86)的目标上,现代编译器仍然只是使用简单的标量加载到寄存器中,就像对指针进行解引用一样。但是,当自动向量化时,gcc不会认为对齐的指针与元素边界对齐,并且会使用未对齐的加载。

memcpy 是您如何在ISO C / C ++中表达未对齐的加载/存储。

  #include< string.h> 

int sum(int * p){
int sum = 0;
for(int i = 0; i <10001; i ++){
// sum + = p [i];
int tmp;
#ifdef USE_ALIGNED
tmp = p [i]; //正常解引用
#else
memcpy(& tmp,& p [i],sizeof(tmp)); // unaligned load
#endif
sum + = tmp;
}
回报金额;
}

使用 gcc7.2 -O3 -DUSE_ALIGNED ,我们得到了通常的标量,直到对准边界,则向量环:( <强> Godbolt编译器资源管理器



  .L4:#gcc7.2正常解引用
add eax,1
paddd xmm0,XMMWORD PTR [rdx]
add rdx, 16
cmp ecx,eax
ja .L4

但用 memcpy ,我们得到了一个未对齐的加载的自动矢量化(没有intro / outro来处理对齐),与gcc的正常首选项不同:

  .L2:#gcc7.2用于未对齐点的memcpy er 
movdqu xmm2,XMMWORD PTR [rdi]
add rdi,16
cmp rax,rdi#end_pointer!=指针
paddd xmm0,xmm2
jne .L2# -mtune = generic仍然没有针对cmp / jcc的宏观融合进行优化:(

#hsum到EAX中,然后是最后的奇数标量元素:
add eax,DWORD PTR [rdi +40000]#这是正常标量代码如何编写memcpy。

在OP的情况下,简单地安排要对齐的指针是更好的选择。它避免了标量代码的缓存行分裂(或者gcc做它的矢量化)。它不会花费太多额外的内存或空间,并且内存中的数据布局也不固定。



但有时候这不是一个选项。 memcpy 当你复制一个基本类型的所有字节时,用现代gcc / clang完全可靠地完全优化。即只是一个加载或存储,没有函数调用,也不会弹回到额外的内存位置。即使在 -O0 ,这个简单的 memcpy 内联函数没有函数调用,但当然 tmp



无论如何,检查编译器生成的asm是否担心它可能无法在更多复杂的情况下,或与不同的编译器。例如,ICC18不会自动矢量化使用memcpy的版本。



uint64_t tmp = 0; 然后例如,低3字节的memcpy编译为内存的实际副本并重新加载,因此这不是表达奇数大小类型的零扩展的好方法。


I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:

#include <inttypes.h>
#include <stdlib.h>

#include <sys/mman.h>

int main()
{
  uint32_t sum = 0;
  uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  uint16_t *p = (buffer + 1);
  int i;

  for (i=0;i<14;++i) {
    //printf("%d\n", i);
    sum += p[i];
  }

  return sum;
}

This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.

If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.

Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?

解决方案

gcc4.8 makes a prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.

I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".


Try with -O3 -fno-tree-vectorize or -O2. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).


gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu.

But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment. Your pointer is misaligned and violates this assumption.

(I wonder if newer glibc headers use __attribute__((alloc_align(4096))) to mark mmap's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc)


on a CPU that is able to access unaligned

SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.

SSE2 is baseline for x86-64, so gcc uses it.


Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell.)

14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.

I put your code on Godbolt, and this is the relevant part of main:

    call    mmap    #
    lea     rdi, [rax+1]      # p,
    mov     rdx, rax  # buffer,
    mov     rax, rdi  # D.2507, p
    and     eax, 15   # D.2507,
    shr     rax        ##### rax>>=1 discards the low byte, assuming it's zero
    neg     rax       # D.2507
    mov     esi, eax  # prolog_loop_niters.7, D.2507
    and     esi, 7    # prolog_loop_niters.7,
    je      .L2
    # .L2 leads directly to a MOVDQA xmm2, [rdx+1]

It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p is odd.


But the code-gen for malloc looks like this:

    call    malloc  #
    movzx   edx, WORD PTR [rax+17]        # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
    movzx   ecx, WORD PTR [rax+27]        # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
    movdqu  xmm2, XMMWORD PTR [rax+1]   # tmp91, MEM[(uint16_t *)buffer_5 + 1B]

Note the use of movdqu. There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking with zero to get uint32_t elements before adding.

(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)


Safe code with unaligned pointers:

If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.

memcpy is how you express an unaligned load / store in ISO C / C++.

#include <string.h>

int sum(int *p) {
    int sum=0;
    for (int i=0 ; i<10001 ; i++) {
        // sum += p[i];
        int tmp;
#ifdef USE_ALIGNED
        tmp = p[i];     // normal dereference
#else
        memcpy(&tmp, &p[i], sizeof(tmp));  // unaligned load
#endif
        sum += tmp;
    }
    return sum;
}

With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)

.L4:    # gcc7.2 normal dereference
    add     eax, 1
    paddd   xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     ecx, eax
    ja      .L4

But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:

.L2:   # gcc7.2 memcpy for an unaligned pointer
    movdqu  xmm2, XMMWORD PTR [rdi]
    add     rdi, 16
    cmp     rax, rdi      # end_pointer != pointer
    paddd   xmm0, xmm2
    jne     .L2           # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(

    # hsum into EAX, then the final odd scalar element:
    add     eax, DWORD PTR [rdi+40000]   # this is how memcpy compiles for normal scalar code, too.

In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.

But sometimes that's not an option. memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.

Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.

uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.

这篇关于为什么在AMD64上未对齐访问mmap内存有时会出现段错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆