为什么在 AMD64 上对 mmap 内存的未对齐访问有时会出现段错误? [英] Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

查看:25
本文介绍了为什么在 AMD64 上对 mmap 内存的未对齐访问有时会出现段错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:

#include <inttypes.h>
#include <stdlib.h>

#include <sys/mman.h>

int main()
{
  uint32_t sum = 0;
  uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  uint16_t *p = (buffer + 1);
  int i;

  for (i=0;i<14;++i) {
    //printf("%d
", i);
    sum += p[i];
  }

  return sum;
}

This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.

If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.

Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?

解决方案

Related: Pascal Cuoq's blog post shows a case where GCC assumes aligned pointers (that two int* don't partially overlap): GCC always assumes aligned pointer accesses. He also links to a 2016 blog post (A bug story: data alignment on x86) that has the exact same bug as this question: auto-vectorization with a misaligned pointer -> segfault.


gcc4.8 makes a loop prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.

I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".


Try with -O3 -fno-tree-vectorize or -O2. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).


gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu.

But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment. Your pointer is misaligned and violates this assumption.

(I wonder if newer glibc headers use __attribute__((assume_aligned(4096))) to mark mmap's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc. Except it wouldn't work because it would break error-checking for mmap != (void*)-1, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)


on a CPU that is able to access unaligned

SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.

SSE2 is baseline for x86-64, so gcc uses it.


Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell.)

14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.

I put your code on Godbolt, and this is the relevant part of main:

    call    mmap    #
    lea     rdi, [rax+1]      # p,
    mov     rdx, rax  # buffer,
    mov     rax, rdi  # D.2507, p
    and     eax, 15   # D.2507,
    shr     rax        ##### rax>>=1 discards the low byte, assuming it's zero
    neg     rax       # D.2507
    mov     esi, eax  # prolog_loop_niters.7, D.2507
    and     esi, 7    # prolog_loop_niters.7,
    je      .L2
    # .L2 leads directly to a MOVDQA xmm2, [rdx+1]

It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p is odd.


But the code-gen for malloc looks like this:

    call    malloc  #
    movzx   edx, WORD PTR [rax+17]        # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
    movzx   ecx, WORD PTR [rax+27]        # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
    movdqu  xmm2, XMMWORD PTR [rax+1]   # tmp91, MEM[(uint16_t *)buffer_5 + 1B]

Note the use of movdqu. There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking with zero to get uint32_t elements before adding.

(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)


Safe code with unaligned pointers:

If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.

memcpy is how you express an unaligned load / store in ISO C / C++.

#include <string.h>

int sum(int *p) {
    int sum=0;
    for (int i=0 ; i<10001 ; i++) {
        // sum += p[i];
        int tmp;
#ifdef USE_ALIGNED
        tmp = p[i];     // normal dereference
#else
        memcpy(&tmp, &p[i], sizeof(tmp));  // unaligned load
#endif
        sum += tmp;
    }
    return sum;
}

With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)

.L4:    # gcc7.2 normal dereference
    add     eax, 1
    paddd   xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     ecx, eax
    ja      .L4

But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:

.L2:   # gcc7.2 memcpy for an unaligned pointer
    movdqu  xmm2, XMMWORD PTR [rdi]
    add     rdi, 16
    cmp     rax, rdi      # end_pointer != pointer
    paddd   xmm0, xmm2
    jne     .L2           # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(

    # hsum into EAX, then the final odd scalar element:
    add     eax, DWORD PTR [rdi+40000]   # this is how memcpy compiles for normal scalar code, too.

In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.

But sometimes that's not an option. memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.

Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.

uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.


GNU C __attribute__((aligned(1))) and may_alias

Instead of memcpy (which won't inline on some ISAs when GCC doesn't know the pointer is aligned, i.e. exactly this use-case), you can also use a typedef with a GCC attribute to make an under-aligned version of a type.

typedef int __attribute__((aligned(1), may_alias)) unaligned_aliasing_int;

typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;

related: Why does glibc's strlen need to be so complicated to run quickly? shows how to make a word-at-a-time bithack C strlen safe with this.

Note that it seems ICC doesn't respect __attribute__((may_alias)), but gcc/clang do. I was recently playing around with that trying to write a portable and safe 4-byte SIMD load like _mm_loadu_si32 (which GCC is missing). https://godbolt.org/z/ydMLCK has various combinations of safe everywhere but inefficient code-gen on some compilers, or unsafe on ICC but good everywhere.

aligned(1) may be less bad than memcpy on ISAs like MIPS where unaligned loads can't be done in one instruction.

You use it like any other pointer.

unaligned_aliasing_int *p = something;
int tmp = *p++;
int tmp2 = *p++;

And of course you can index it as normal like p[i].

这篇关于为什么在 AMD64 上对 mmap 内存的未对齐访问有时会出现段错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆