__builtin_unreachable可以促进哪些优化? [英] What optimizations does __builtin_unreachable facilitate?

查看:99
本文介绍了__builtin_unreachable可以促进哪些优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从gcc文档中判断

如果控制流到达__builtin_unreachable的点,则程序未定义.

If control flow reaches the point of the __builtin_unreachable, the program is undefined.

我认为__builtin_unreachable可以以各种创造性的方式用作优化器的提示.所以我做了一个小实验

I thought __builtin_unreachable may be used as a hint to the optimizer in all sorts of creative ways. So I did a little experiment

void stdswap(int& x, int& y)
{
    std::swap(x, y);
}

void brswap(int& x, int& y)
{
    if(&x == &y)
        __builtin_unreachable();
    x ^= y;
    y ^= x;
    x ^= y;
}

void rswap(int& __restrict x, int& __restrict y)
{
    x ^= y;
    y ^= x;
    x ^= y;
}

被编译为(g ++ -O2)

gets compiled to (g++ -O2)

stdswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rsi], eax
        ret
brswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rsi], eax
        xor     DWORD PTR [rdi], eax
        ret
rswap(int&, int&):
        mov     eax, DWORD PTR [rsi]
        mov     edx, DWORD PTR [rdi]
        mov     DWORD PTR [rdi], eax
        mov     DWORD PTR [rsi], edx
        ret

从优化程序的角度来看,我认为stdswaprswap是最佳的.为什么brswap不能编译成相同的东西?我可以使用__builtin_unreachable将其编译为相同的东西吗?

I assume that stdswap and rswap is optimal from the optimizer's perspective. Why doesn't brswap get compiled to the same thing? Can I get it to compile to the same thing with __builtin_unreachable?

推荐答案

__builtin_unreachable的目的是帮助编译器删除死代码(程序员知道永远不会执行)并通过让编译器知道线性化代码路径是冷"的.请考虑以下内容:

The purpose of __builtin_unreachable is to help the compiler to remove dead code (that programmer knows will never be executed) and to linearize the code by letting compiler know that the path is "cold". Consider the following:

void exit_if_true(bool x);

int foo1(bool x)
{
    if (x) {
        exit_if_true(true);
        //__builtin_unreachable(); // we do not enable it here
    } else {
        std::puts("reachable");
    }

    return 0;
}
int foo2(bool x)
{
    if (x) {
        exit_if_true(true);
        __builtin_unreachable();  // now compiler knows exit_if_true
                                  // will not return as we are passing true to it
    } else {
        std::puts("reachable");
    }

    return 0;
}

生成的代码:

foo1(bool):
        sub     rsp, 8
        test    dil, dil
        je      .L2              ; that jump is going to change
        mov     edi, 1
        call    exit_if_true(bool)
        xor     eax, eax         ; that tail is going to be removed
        add     rsp, 8
        ret
.L2:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
foo2(bool):
        sub     rsp, 8
        test    dil, dil
        jne     .L9              ; changed jump
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
.L9:
        mov     edi, 1
        call    exit_if_true(bool)

注意差异:

  • xor eax, eaxret被删除,因为现在编译器知道这是无效代码.
  • 编译器交换了分支顺序:现在首先调用puts分支,以便条件跳转可以更快(在预测时和没有预测信息时,未采用的前向分支都更快).
  • xor eax, eax and ret were removed as now compiler knows that is a dead code.
  • The compiler swapped the order of branches: branch with puts call now comes first so that conditional jump can be faster (forward branches that are not taken are faster both when predicted and when there is no prediction information).

这里的假设是,以noreturn函数调用或__builtin_unreachable结尾的分支将仅执行一次,或者导致longjmp调用或异常抛出,这两种情况很少发生,因此在执行过程中无需进行优先级排序优化.

The assumption here is that branch that ends with noreturn function call or __builtin_unreachable will either be executed only once or leads to longjmp call or exception throw both of which are rare and do not need to be prioritized during optimization.

您正在尝试将其用于其他目的-通过提供有关别名的编译器信息(并且您可以尝试对对齐进行相同操作).不幸的是,海湾合作委员会不了解这种地址检查.

You are trying to use it for a different purpose - by giving compiler information about aliasing (and you can try doing the same for alignment). Unfortunately GCC doesn't understand such address checks.

您已经注意到,添加__restrict__会有所帮助.因此__restrict__可用于别名,而__builtin_unreachable则不能.

As you have noticed, adding __restrict__ helps. So __restrict__ works for aliasing, __builtin_unreachable does not.

请看以下使用__builtin_assume_aligned的示例:

void copy1(int *__restrict__ dst, const int *__restrict__ src)
{
    if (reinterpret_cast<uintptr_t>(dst) % 16 == 0) __builtin_unreachable();
    if (reinterpret_cast<uintptr_t>(src) % 16 == 0) __builtin_unreachable();

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copy2(int *__restrict__ dst, const int *__restrict__ src)
{
    dst = static_cast<int *>(__builtin_assume_aligned(dst, 16));
    src = static_cast<const int *>(__builtin_assume_aligned(src, 16));

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

生成的代码:

copy1(int*, int const*):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi], xmm0
        ret
copy2(int*, int const*):
        movdqa  xmm0, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm0
        ret

您可以假设编译器可以理解dst % 16 == 0意味着指针是16字节对齐的,但事实并非如此.因此,使用了未对齐的存储和加载,而第二个版本生成了更快的指令,这些指令要求地址要对齐.

You could assume that compiler can understand that dst % 16 == 0 means the pointer is 16-byte aligned, but it doesn't. So unaligned stores and loads are used, while the second version generates faster instructions that require address to be aligned.

这篇关于__builtin_unreachable可以促进哪些优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆