为什么在AMD64上未对齐访问mmap内存有时会出现段错误？ [英] Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

查看：378 发布时间：2018/4/20 16:04:34 c gcc x86-64 mmap auto-vectorization
本文介绍了为什么在AMD64上未对齐访问mmap内存有时会出现段错误？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

  #include< p>< p> inttypes.h> 
 #include< stdlib.h> 
 
 #include< sys / mman.h> 
 
 int main（）
 {
 uint32_t sum = 0; 
 uint8_t * buffer = mmap（NULL，1 << 18，PROT_READ，
 MAP_PRIVATE | MAP_ANONYMOUS，-1,0）; 
 uint16_t * p =（buffer + 1）; 
 int i; （i = 0; i <14; ++ i）{
 // printf（％d \ n，i）; 
 
。 
 sum + = p [i]; 
} 
 
归还金额; 
 
 
 
 $ b $ p 
 $ b 如果内存分配使用 mmap 。如果我使用 malloc ，堆栈上的缓冲区或全局变量，它不会出现段错误。
 
 
 如果我将循环的迭代次数减少到小于14的任何值，不再是段错误。如果我从循环中打印数组索引，它也不再是段错误。
 
 
 为什么访问未对齐地址的CPU上的未对齐内存访问segfault，以及为什么只有在这种特定的情况下？
解决方案
 gcc4.8创建了一个试图达到对齐边界的序言，但它假定 uint16_t * p 是2字节对齐的，即一些标量迭代将使指针16字节对齐。 b 
 $ b 
我不认为gcc有意支持x86上的未对齐指针，它恰好适用于没有自动向量化的非原子类型。在ISO C中，使用指向 uint16_t 且指向小于 alignof（uint16_t）= 2 对齐的指针肯定是未定义的行为。 GCC不会在编译时看到你违反规则时发出警告，而且实际上恰好会产生工作代码（对于 malloc ，它知道返回值最小对齐） ，但这是大概只是海湾合作委员会内部事故，不应被视为支持的表示。 
 
 
 
 
 尝试 -O3 -fno-tree-vectorize 或 -O2 。如果我的解释是正确的，那就不会出现段错误，因为它只会使用标量负载（正如你在x86上所说的没有任何对齐要求）。
 
 
 
 
 
  gcc知道 malloc 返回此目标上的16字节对齐内存（x86-64 Linux，其中 maxalign_t 的宽度为16个字节，因为 long double 在x86-64 System V ABI中填充为16个字节）。它看到你在做什么，并使用 movdqu 。
 
 
 但gcc不会将 mmap 作为内建函数，所以它不知道它返回的是页面对齐的内存，并且应用了它通常的自动矢量化策略，这显然假定 uint16_t * p 是2字节对齐的，因此在处理未对齐后可以使用 movdqa 。您的指针未对齐，违反了这一假设。
 
 
 （我想知道新的glibc头文件是否使用 __属性__（（alloc_align（4096）））将 mmap 的返回值标记为对齐。这将是一个好主意，并且可能会给出与<$ $相同的代码生成代码c code $> 
 
 
lock $ $ $ b pre $在CPU上能够访问未对齐的 
 
 SSE2  movdqa  segfaults on unaligned，and your elements本身没有对齐，所以你有一个不寻常的情况，即没有数组元素以16字节的边界开始。
 
 
  SSE2是x86-64的基准，因此gcc使用它。 / p> 
 
 
 
 
   Ubuntu 14.04LTS使用gcc4.8.2 （偏离主题：这是旧的和过时的，在许多情况下比gcc5.4或gcc6.4更糟糕的代码，特别是在自动矢量化时，它甚至不识别） -march = haswell 。） 
 
 
  14是gcc启发式决定自动化的最小阈值 - 使用 -O3 并且不存在 -march 或 -mtune 选项。
 
 
 我把你的代码放在 on Godbolt ，这是 ma的相关部分in ：
  call mmap＃
 lea rdi，[rax + 1]＃p ，
 mov rdx，rax＃buffer，
 mov rax，rdi＃D.2507，p 
和eax，15＃D.2507，
 shr rax ##### rax>> = 1丢弃低字节，假设它为零
 neg rax＃D.2507 
 mov esi，eax＃prolog_loop_niters.7，D.2507 
和esi，7＃prolog_loop_niters .7，
 je .L2 
＃.L2直接导致MOVDQA xmm2，[rdx + 1] 
  
它计算出（用这块代码）在到达MOVDQA之前要做多少标量迭代，但没有任何代码路径导致MOVDQU循环。即gcc没有代码路径来处理 p 为奇数的情况。
 
 
 
 
 
  但是malloc的代码如下所示：  
 
 
  call malloc＃
 movzx edx，WORD PTR [rax + 17]＃D.2497，MEM [（uint16_t *）buffer_5 + 17B] 
 movzx ecx，WORD PTR [rax + 27]＃D.2497， MEM [（uint16_t *）buffer_5 + 27B] 
 movdqu xmm2，XMMWORD PTR [rax + 1]＃tmp91，MEM [（uint16_t *）buffer_5 + 1B] 
  pre> 
 
 请注意使用 movdqu 。还有一些标量 movzx 载入混合：总共14次迭代中有8次是SIMD，剩下的6次是标量。这是一个错过优化：它可以轻松地使用 movq 加载来执行另一个4，尤其是因为在将
解包为零以填充XMM向量以获得uint32_t元素之前添加。
 
 
 （还有其他各种遗漏优化，如可能使用 pmaddwd 和乘数 1 将单词的水平对添加到双字元素中。）
 
 
 
 
 
 安全带有未对齐指针的代码：
 
 
 如果您确实想要编写使用未对齐指针的代码，您可以在ISO C中使用 memcpy 。在具有高效未对齐加载支持（如x86）的目标上，现代编译器仍然只是使用简单的标量加载到寄存器中，就像对指针进行解引用一样。但是，当自动向量化时，gcc不会认为对齐的指针与元素边界对齐，并且会使用未对齐的加载。 
 
 
   memcpy 是您如何在ISO C / C ++中表达未对齐的加载/存储。 
  #include< string.h> 
 
 int sum（int * p）{
 int sum = 0; 
 for（int i = 0; i <10001; i ++）{
 // sum + = p [i]; 
 int tmp; 
 #ifdef USE_ALIGNED 
 tmp = p [i]; //正常解引用
 #else 
 memcpy（& tmp，& p [i]，sizeof（tmp））; // unaligned load 
 #endif 
 sum + = tmp; 
} 
回报金额; 
} 
  
使用 gcc7.2 -O3 -DUSE_ALIGNED ，我们得到了通常的标量，直到对准边界，则向量环：（ <强> Godbolt编译器资源管理器 ）
 
 
 
  .L4：＃gcc7.2正常解引用
 add eax，1 
 paddd xmm0，XMMWORD PTR [rdx] 
 add rdx， 16 
 cmp ecx，eax 
 ja .L4 
  
但用 memcpy ，我们得到了一个未对齐的加载的自动矢量化（没有intro / outro来处理对齐），与gcc的正常首选项不同：
  .L2：＃gcc7.2用于未对齐点的memcpy er 
 movdqu xmm2，XMMWORD PTR [rdi] 
 add rdi，16 
 cmp rax，rdi＃end_pointer！=指针
 paddd xmm0，xmm2 
 jne .L2＃ -mtune = generic仍然没有针对cmp / jcc的宏观融合进行优化:( 
 
＃hsum到EAX中，然后是最后的奇数标量元素：
 add eax，DWORD PTR [rdi +40000]＃这是正常标量代码如何编写memcpy。 
  
在OP的情况下，简单地安排要对齐的指针是更好的选择。它避免了标量代码的缓存行分裂（或者gcc做它的矢量化）。它不会花费太多额外的内存或空间，并且内存中的数据布局也不固定。
 
 
 但有时候这不是一个选项。  memcpy 当你复制一个基本类型的所有字节时，用现代gcc / clang完全可靠地完全优化。即只是一个加载或存储，没有函数调用，也不会弹回到额外的内存位置。即使在 -O0 ，这个简单的 memcpy 内联函数没有函数调用，但当然 tmp 
 
 
 无论如何，检查编译器生成的asm是否担心它可能无法在更多复杂的情况下，或与不同的编译器。例如，ICC18不会自动矢量化使用memcpy的版本。
 
 
   uint64_t tmp = 0; 然后例如，低3字节的memcpy编译为内存的实际副本并重新加载，因此这不是表达奇数大小类型的零扩展的好方法。
 
I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:
#include <inttypes.h>
#include <stdlib.h>

#include <sys/mman.h>

int main()
{
  uint32_t sum = 0;
  uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  uint16_t *p = (buffer + 1);
  int i;

  for (i=0;i<14;++i) {
    //printf("%d\n", i);
    sum += p[i];
  }

  return sum;
}
This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.

If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.

Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?
 解决方案 
gcc4.8 makes a prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.

I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization.  It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment.  GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".



Try with -O3 -fno-tree-vectorize or -O2.  If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).



gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI).  It sees what you're doing and uses movdqu.

But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment.  Your pointer is misaligned and violates this assumption.

(I wonder if newer glibc headers use __attribute__((alloc_align(4096))) to mark mmap's return value as aligned.  That would be a good idea, and would probably have given you about the same code-gen as for malloc)




  on a CPU that is able to access unaligned 
SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.

SSE2 is baseline for x86-64, so gcc uses it.



Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing.  It doesn't even recognize -march=haswell.)

14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.

I put your code on Godbolt, and this is the relevant part of main:
    call    mmap    #
    lea     rdi, [rax+1]      # p,
    mov     rdx, rax  # buffer,
    mov     rax, rdi  # D.2507, p
    and     eax, 15   # D.2507,
    shr     rax        ##### rax>>=1 discards the low byte, assuming it's zero
    neg     rax       # D.2507
    mov     esi, eax  # prolog_loop_niters.7, D.2507
    and     esi, 7    # prolog_loop_niters.7,
    je      .L2
    # .L2 leads directly to a MOVDQA xmm2, [rdx+1]
It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop.  i.e. gcc doesn't have a code path to handle the case where p is odd.



But the code-gen for malloc looks like this:
    call    malloc  #
    movzx   edx, WORD PTR [rax+17]        # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
    movzx   ecx, WORD PTR [rax+27]        # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
    movdqu  xmm2, XMMWORD PTR [rax+1]   # tmp91, MEM[(uint16_t *)buffer_5 + 1B]
Note the use of movdqu.  There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar.  This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking 
 with zero to get uint32_t elements before adding.

(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)



Safe code with unaligned pointers:

If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy.  On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer.  But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.

memcpy is how you express an unaligned load / store in ISO C / C++.
#include <string.h>

int sum(int *p) {
    int sum=0;
    for (int i=0 ; i<10001 ; i++) {
        // sum += p[i];
        int tmp;
#ifdef USE_ALIGNED
        tmp = p[i];     // normal dereference
#else
        memcpy(&tmp, &p[i], sizeof(tmp));  // unaligned load
#endif
        sum += tmp;
    }
    return sum;
}
With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)
.L4:    # gcc7.2 normal dereference
    add     eax, 1
    paddd   xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     ecx, eax
    ja      .L4
But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:
.L2:   # gcc7.2 memcpy for an unaligned pointer
    movdqu  xmm2, XMMWORD PTR [rdi]
    add     rdi, 16
    cmp     rax, rdi      # end_pointer != pointer
    paddd   xmm0, xmm2
    jne     .L2           # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(

    # hsum into EAX, then the final odd scalar element:
    add     eax, DWORD PTR [rdi+40000]   # this is how memcpy compiles for normal scalar code, too.
In the OP's case, simply arranging for pointers to be aligned is a better choice.  It avoids cache-line splits for scalar code (or for vectorized the way gcc does it).  It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.

But sometimes that's not an option.  memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type.  i.e. just a load or store, no function call and no bouncing to an extra memory location.  Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.

Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers.  For example, ICC18 doesn't auto-vectorize the version using memcpy.

uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.

                        这篇关于为什么在AMD64上未对齐访问mmap内存有时会出现段错误？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
查看全文
为什么在AMD64上未对齐访问mmap内存有时会出现段错误？ [英] Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

问题描述

安全带有未对齐指针的代码：

Safe code with unaligned pointers:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在AMD64上未对齐访问mmap内存有时会出现段错误？ [英] Why does unaligned access to mmap&#39;ed memory sometimes segfault on AMD64?

问题描述

安全带有未对齐指针的代码：

Safe code with unaligned pointers:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么在AMD64上未对齐访问mmap内存有时会出现段错误？ [英] Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

登录关闭