gcc vs clang:用-fPIC内联一个函数 [英] gcc vs clang: inlining a function with -fPIC

查看:185
本文介绍了gcc vs clang:用-fPIC内联一个函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下代码:

  // foo.cxx 
int last;

int next(){
return ++ last;
}

int index(int scale){
return next()<<规模;
}

使用gcc 7.2编译时:

  $ g ++ -std = c ++ 11 -O3 -fPIC 


$

  next():
movq last @ GOTPCREL(%rip),b $ b

%rdx
movl(%rdx),%eax
addl $ 1,%eax
movl%eax,(%rdx)
ret
index(int):
pushq%rbx
movl%edi,%ebx
call next()@ PLT ## next()不内联,通过PLT调用
movl%ebx,%ecx
sall%cl,%eax
popq%rbx
ret

然而,当使用相同的标志使用clang 3.9编译相同的代码时:

  next():#@next()
movq last @ GOTPCREL(%rip),%rcx
movl(%rcx),%eax
incl%eax
movl%eax,(%rcx)
retq

index(int):#@index(int)
movq last @ GOTPCR EL(%rip),%rcx
movl(%rcx),%eax
incl%eax ## next()被内联!
movl%eax,(%rcx)
movl%edi,%ecx
shll%cl,%eax
retq
pre>

gcc调用 next()通过PLT,clang将其内联。两者仍然从GOT中查找 last 。对于在linux上进行编译,clang是否有权进行优化,并且gcc错过了简单的内联,或者认为是错误的进行优化,或者纯粹是QoI问题?

解决方案

我不认为标准会涉及到那么多细节。它只是说,如果符号在不同的翻译单位中有外部联系,那么它就是同一个符号。这使得铿锵的版本正确。



从那时起,就我所知,我们已经超出了标准。编译器的选择与他们认为有用的 -fPIC 输出不同。



请注意 g ++ -c -std = c ++ 11 -O3 -fPIE 输出:

  0000000000000000 <_Z4nextv> ;:
0:8b 05 00 00 00 00 mov 0x0(%rip),%eax#6< _Z4nextv + 0x6>
6:83 c0 01 add $ 0x1,%eax
9:89 05 00 00 00 00 mov%eax,0x0(%rip)#f <_Z4nextv + 0xf>
f:c3 retq

0000000000000010< _Z5indexi> ;:
10:8b 05 00 00 00 00 mov 0x0(%rip),%eax#16 <_Z5indexi + 0x6>
16:89 f9 mov%edi,%ecx
18:83 c0 01 add $ 0x1,%eax
1b:89 05 00 00 00 00 mov%eax,0x0(%rip) #21< _Z5indexi + 0x11>
21:d3 e0 shl%cl,%eax
23:c3 retq

因此,GCC 确实知道如何优化这一点。它只是选择不使用 -fPIC 时。但为什么?我只能看到一个解释:可以在动态链接期间重写符号,并持续查看效果。该技术被称为符号插入



在共享库中,如果 index 调用 next ,则为 next 是全局可见的,gcc必须考虑可能插入 next 的可能性。所以它使用PLT。然而,当使用 -fPIE 时,您不允许插入符号,所以gcc可以进行优化。

铛错了?但是,gcc似乎为符号插入提供了更好的支持,这对于编写代码非常方便。如果使用 -fPIC 而不是 -fPIE 来构建他的可执行文件,那么这样做的代价是一些开销。




其他注意事项:

这个博客文章来自gcc开发者之一,他提到,在这篇文章:


在比较clang的一些基准时,我注意到clang实际上忽略了ELF插入规则。虽然它是bug,但我决定向GCC添加 -fno-semantic-interposition 标志以获得类似的行为。如果不想插入,ELF的正式答案是使用隐藏的可见性,如果需要导出的符号定义一个别名。这并不总是实际的事情。


继这位领导之后,我登上了 x86-64 ABI规范。在第3.5.5节中,它要求所有调用全局可见符号的函数都必须经过PLT(它根据内存模型定义准确的指令序列)。

因此,尽管它没有违反C ++标准,但忽略语义介入似乎违反了ABI。




最后一句话:不知道该把它放在哪里,但它可能对你有兴趣。我会尽量避免转储,但是使用objdump和编译器选项进行的测试表明:



在gcc方面:


  • gcc -fPIC 访问上次通过GOT,调用 next()通过PLT。

  • gcc -fPIC -fno-semantic-interposition last 通过GOT, next()
  • gcc -fPIE last / code>是IP相对的, next()被内联。
  • / code>意味着 -fno-semantic-interposition


开事情的叮当声:


  • clang -fPIC


  • clang -fPIE last 会经过GOT, next )已内联。



以及编译为两个编译器内联的IP相对的修改版本:

  // foo.cxx 
int last_ __attribute __((visibility(hidden)));
extern int last __attribute __((alias(last_)));

int __attribute __((visibility(hidden)))next_()
{
return ++ last_;
}
//这个很丑,因为别名需要重名。可以用externCnext_来代替。
extern int next()__attribute __((别名(_ Z5next_v)));

int index(int scale){
return next_()<<规模;
}

基本上,这明确表示尽管全局可用,但我们使用隐藏版本这些符号将忽略任何类型的插入。这两个编译器都会完全优化访问权限,而不管传递的选项如何。


Consider this code:

// foo.cxx
int last;

int next() {
  return ++last;
}

int index(int scale) {
  return next() << scale;
}

When compiling with gcc 7.2:

$ g++ -std=c++11 -O3 -fPIC

This emits:

next():
    movq    last@GOTPCREL(%rip), %rdx
    movl    (%rdx), %eax
    addl    $1, %eax
    movl    %eax, (%rdx)
    ret
index(int):
    pushq   %rbx
    movl    %edi, %ebx
    call    next()@PLT    ## next() not inlined, call through PLT
    movl    %ebx, %ecx
    sall    %cl, %eax
    popq    %rbx
    ret

However, when compiling the same code with the same flags using clang 3.9 instead:

next():                               # @next()
    movq    last@GOTPCREL(%rip), %rcx
    movl    (%rcx), %eax
    incl    %eax
    movl    %eax, (%rcx)
    retq

index(int):                              # @index(int)
    movq    last@GOTPCREL(%rip), %rcx
    movl    (%rcx), %eax
    incl    %eax              ## next() was inlined!
    movl    %eax, (%rcx)
    movl    %edi, %ecx
    shll    %cl, %eax
    retq

gcc calls next() via the PLT, clang inlines it. Both still lookup last from the GOT. For compiling on linux, is clang right to make that optimization and gcc is missing out on easy inlining, or is clang wrong to make that optimization, or is this purely a QoI issue?

解决方案

I don't think the standard goes into that much detail. It merely says that roughly if the symbol has external linkage in different translation units, it is the same symbol. That makes clang's version correct.

From that point on, to the best of my knowledge, we're out of the standard. Compilers choices differ on what they consider a useful -fPIC output.

Note that g++ -c -std=c++11 -O3 -fPIE outputs:

0000000000000000 <_Z4nextv>:
   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 <_Z4nextv+0x6>
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # f <_Z4nextv+0xf>
   f:   c3                      retq   

0000000000000010 <_Z5indexi>:
  10:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 16 <_Z5indexi+0x6>
  16:   89 f9                   mov    %edi,%ecx
  18:   83 c0 01                add    $0x1,%eax
  1b:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 21 <_Z5indexi+0x11>
  21:   d3 e0                   shl    %cl,%eax
  23:   c3                      retq

So GCC does know how to optimize this. It just chooses not to when using -fPIC. But why? I can see only one explanation: make it possible to override the symbol during dynamic linking, and see the effects consistently. The technique is known as symbol interposition.

In a shared library, if index calls next, as next is globally visible, gcc has to consider the possibility that next could be interposed. So it uses the PLT. When using -fPIE however, you are not allowed to interpose symbols, so gcc enables the optimization.

So is clang wrong? No. But gcc seems to provide better support for symbol interposition, which is handy for instrumenting the code. It does so at the cost of some overhead if one uses -fPIC instead of -fPIE for building his executable though.


Additional notes:

In this blog entry from one of gcc developers, he mentions, around the end of the post:

While comparing some benchmarks to clang, I noticed that clang actually ignore ELF interposition rules. While it is bug, I decided to add -fno-semantic-interposition flag to GCC to get similar behaviour. If interposition is not desirable, ELF's official answer is to use hidden visibility and if the symbol needs to be exported define an alias. This is not always practical thing to do by hand.

Following that lead landed me on the x86-64 ABI spec. In section 3.5.5, it does mandate that all functions calling a globally visible symbols must go through the PLT (it goes as far as defining the exact instruction sequence to use depending on memory model).

So, though it does not violate C++ standard, ignoring semantic interposition seems to violate the ABI.


Last word: didn't know where to put this, but it might be of interest to you. I'll spare you the dumps, but my tests with objdump and compiler options showed that:

On the gcc side of things:

  • gcc -fPIC: accesses to last goes through GOT, calls to next() goes through PLT.
  • gcc -fPIC -fno-semantic-interposition: last goes through GOT, next() is inlined.
  • gcc -fPIE: last is IP-relative, next() is inlined.
  • -fPIE implies -fno-semantic-interposition

On the clang side of things:

  • clang -fPIC: last goes through GOT, next() is inlined.
  • clang -fPIE: last goes through GOT, next() is inlined.

And a modified version that compiles to IP-relative, inlined on both compilers:

// foo.cxx
int last_ __attribute__((visibility("hidden")));
extern int last __attribute__((alias("last_")));

int __attribute__((visibility("hidden"))) next_()
{
  return ++last_;
}
// This one is ugly, because alias needs the mangled name. Could extern "C" next_ instead.
extern int next() __attribute__((alias("_Z5next_v")));

int index(int scale) {
  return next_() << scale;
}

Basically, this explicitly marks that despite making them available globally, we use hidden version of those symbols that will ignore any kind of interposition. Both compilers then fully optimize the accesses, regardless of passed options.

这篇关于gcc vs clang:用-fPIC内联一个函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆