GCC 4.8中C ++ 11 thread_local变量的性能损失是多少？ [英] What is the performance penalty of C++11 thread_local variables in GCC 4.8?

查看：143 发布时间：2018/4/20 16:15:01 c++ linux multithreading gcc c++11

本文介绍了GCC 4.8中C ++ 11 thread_local变量的性能损失是多少？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

G ++现在执行 C ++ 11 thread_local 关键字;这不同于
GNU __ thread 关键字，因为它允许动态
初始化和销毁语义。不幸的是，即使这些支持不需要动态初始化，对
的支持也需要对非本函数本地
thread_local 变量的运行时惩罚，
，因此用户可能希望继续对带有
静态初始化语义的TLS变量使用 __ thread 。

这个运行时间惩罚的性质和起源究竟是什么？

显然要支持非函数本地 thread_local 变量需要在进入每个线程主体之前进行线程初始化阶段（就像全局变量有一个静态初始化阶段一样），但是它们是指一些运行大概是什么是gcc的thread_local新实现的体系结构？

解决方案

（免责声明：我不太了解GCC的内部，所以这也是一个有教养的猜测。）

thread_local 初始化被添加到commit 462819c 。其中一个变化是：

* semantics.c（finish_id_expression）：替换使用thread_local $ code $>变量与调用它的包装。

$ b $因此，运行时惩罚是， thread_local 变量的每个引用都将成为一个函数调用。让我们来看一个简单的测试用例：

pre $ c
extern thread_local int tls;
int main（）{
tls + = 37; //第6行
tls& = 11; //第7行
tls ^ = 3; //第8行
返回0;
}

// 4.cpp

thread_local int tls = 42;

当编译*时，我们看到每个使用 tls 引用成为对 _ZTW3tls 的函数调用，它会一次性地初始化该变量：

  00000000004005b0< main> ;: 
 main（）：
 4005b0：55 push rbp 
 4005b1：48 89 e5 mov rbp，rsp 
 4005b4：e8 26 00 00 00致电4005df< _ZTW3tls> //第6行
 4005b9：8b 10 mov edx，DWORD PTR [rax] 
 4005bb：83 c2 25 add edx，0x25 
 4005be：89 10 mov DWORD PTR [rax]，edx 
 4005c0：e8 1a 00 00 00致电4005df< _ZTW3tls> //第7行
 4005c5：8b 10 mov edx，DWORD PTR [rax] 
 4005c7：83 e2 0b and edx，0xb 
 4005ca：89 10 mov DWORD PTR [rax]，edx 
 4005cc：e8 0e 00 00 00致电4005df< _ZTW3tls> //第8行
 4005d1：8b 10 mov edx，DWORD PTR [rax] 
 4005d3：83 f2 03 xor edx，0x3 
 4005d6：89 10 mov DWORD PTR [rax]，edx 
 4005d8：b8 00 00 00 00 mov eax，0x0 //第9行
 4005dd：5d pop rbp 
 4005de：c3 ret 
 
 00000000004005df< _ZTW3tls>：
 $ b 4005e3：b8 00 00 00 00 mov eax，0x0 
 4005e8：
 4005df：55 push rbp 
 4005e0：48 89 e5 mov rbp，rsp 
 4005e8： 48 85 c0 test rax，rax 
 4005eb：74 05 je 4005f2< _ZTW3tls + 0x13> 
 4005ed：e8 0e fa bf ff call 0< tls> //初始化TLS 
 4005f2：64 48 8b 14 25 00 00 00 00 mov rdx，QWORD PTR fs：0x0 
 4005fb：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc 
 400602 ：48 01 d0 add rax，rdx 
 400605：5d pop rbp 
 400606：c3 ret

将它与 __ thread 版本进行比较，该版本没有这个额外的包装：

  00000000004005b0< main> ;: 
 main（）：
 4005b0：55 push rbp 
 4005b1：48 89 e5 mov rbp，rsp 
 4005b4：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc // line 6 
 4005bb：64 8b 00 mov eax，DWORD PTR fs：[rax] 
 4005be：8d 50 25 lea edx，[rax + 0x25] 
 4005c1：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc 
 4005c8：64 89 10 mov DWORD PTR fs：[rax]，edx 
 4005cb：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc // line 7 
 4005d2：64 8b 00 mov eax，DWORD PTR fs：[rax] 
 4005d5：89 c2 mov edx，eax 
 4005d7：83 e2 0b and edx，0xb 
 4005da：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc 
 4005e1：64 89 10 mov DWORD PTR fs：[rax]，edx 
 4005e4：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc // line 8 
 4005eb：64 8b 00 mov eax，DWORD PTR fs：[rax] 
 4005ee：89 c2 mov edx，eax 
 4005f0：83 f2 03 xor edx，0x3 
 4005f3：48 c7 c0 fc ff ff ff mov rax，0xfffffffffffffffc 
 4005fa：64 89 10 mov DWORD PTR fs：[rax]，edx 
 4005fd：b8 00 00 00 00 mov eax，0x0 //第9行
 400602：5d pop rbp 
 400603：c3 ret

这个包装在每个使用情况下都不需要不过， thread_local 。这可以从 decl2.c 。
仅在以下情况下生成包装：

它不是函数局部的，和

它是 extern （上面的例子） li>

这个类型有一个非平凡的析构函数（不允许 __ thread 变量），或者

类型变量由非常量表达式初始化（对于 __ thread 变量也是不允许的）。

在所有其他用例中，它的行为与 __螺纹。这意味着，除非你有一些 extern __thread 变量，你可以用 thread_local替换所有 __ thread 没有任何性能损失。

*：我使用-O0进行编译，因为内联函数会使函数边界更不明显。即使我们转向-O3，那些初始化检查仍然存在。

From the GCC 4.8 draft changelog:

G++ now implements the C++11 thread_local keyword; this differs from the GNU __thread keyword primarily in that it allows dynamic initialization and destruction semantics. Unfortunately, this support requires a run-time penalty for references to non-function-local thread_local variables even if they don't need dynamic initialization, so users may want to continue to use __thread for TLS variables with static initialization semantics.

What is precisely the nature and origin of this run-time penalty?

Obviously to support non-function-local thread_local variables there needs to be a thread initialization phase before the entry to every thread main (just as there is a static initialization phase for global variables), but are they referring to some run-time penalty beyond that?

Roughly speaking what is the architecture of gcc's new implementation of thread_local?
解决方案
(Disclaimer: I don't know much about the internals of GCC, so this is also an educated guess.)

The dynamic thread_local initialization is added in commit 462819c. One of the change is:

* semantics.c (finish_id_expression): Replace use of thread_local
variable with a call to its wrapper.

So the run-time penalty is that, every reference of the thread_local variable will become a function call. Let's check with a simple test case:
// 3.cpp extern thread_local int tls; int main() { tls += 37; // line 6 tls &= 11; // line 7 tls ^= 3; // line 8 return 0; } // 4.cpp thread_local int tls = 42;
When compiled*, we see that every use of the tls reference becomes a function call to _ZTW3tls, which lazily initialize the the variable once:
00000000004005b0 <main>: main(): 4005b0: 55 push rbp 4005b1: 48 89 e5 mov rbp,rsp 4005b4: e8 26 00 00 00 call 4005df <_ZTW3tls> // line 6 4005b9: 8b 10 mov edx,DWORD PTR [rax] 4005bb: 83 c2 25 add edx,0x25 4005be: 89 10 mov DWORD PTR [rax],edx 4005c0: e8 1a 00 00 00 call 4005df <_ZTW3tls> // line 7 4005c5: 8b 10 mov edx,DWORD PTR [rax] 4005c7: 83 e2 0b and edx,0xb 4005ca: 89 10 mov DWORD PTR [rax],edx 4005cc: e8 0e 00 00 00 call 4005df <_ZTW3tls> // line 8 4005d1: 8b 10 mov edx,DWORD PTR [rax] 4005d3: 83 f2 03 xor edx,0x3 4005d6: 89 10 mov DWORD PTR [rax],edx 4005d8: b8 00 00 00 00 mov eax,0x0 // line 9 4005dd: 5d pop rbp 4005de: c3 ret 00000000004005df <_ZTW3tls>: _ZTW3tls(): 4005df: 55 push rbp 4005e0: 48 89 e5 mov rbp,rsp 4005e3: b8 00 00 00 00 mov eax,0x0 4005e8: 48 85 c0 test rax,rax 4005eb: 74 05 je 4005f2 <_ZTW3tls+0x13> 4005ed: e8 0e fa bf ff call 0 <tls> // initialize the TLS 4005f2: 64 48 8b 14 25 00 00 00 00 mov rdx,QWORD PTR fs:0x0 4005fb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 400602: 48 01 d0 add rax,rdx 400605: 5d pop rbp 400606: c3 ret
Compare it with the __thread version, which won't have this extra wrapper:
00000000004005b0 <main>: main(): 4005b0: 55 push rbp 4005b1: 48 89 e5 mov rbp,rsp 4005b4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 6 4005bb: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005be: 8d 50 25 lea edx,[rax+0x25] 4005c1: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005c8: 64 89 10 mov DWORD PTR fs:[rax],edx 4005cb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 7 4005d2: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005d5: 89 c2 mov edx,eax 4005d7: 83 e2 0b and edx,0xb 4005da: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005e1: 64 89 10 mov DWORD PTR fs:[rax],edx 4005e4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 8 4005eb: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005ee: 89 c2 mov edx,eax 4005f0: 83 f2 03 xor edx,0x3 4005f3: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005fa: 64 89 10 mov DWORD PTR fs:[rax],edx 4005fd: b8 00 00 00 00 mov eax,0x0 // line 9 400602: 5d pop rbp 400603: c3 ret
This wrapper is not needed for in every use case of thread_local though. This can be revealed from decl2.c. The wrapper is generated only when:

It is not function-local, and,

It is extern (the example shown above), or

The type has a non-trivial destructor (which is not allowed for __thread variables), or

The type variable is initialized by a non-constant-expression (which is also not allowed for __thread variables).

In all other use cases, it behaves the same as __thread. That means, unless you have some extern __thread variables, you could replace all __thread by thread_local without any loss of performance.

*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.

这篇关于GCC 4.8中C ++ 11 thread_local变量的性能损失是多少？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

GCC 4.8中C ++ 11 thread_local变量的性能损失是多少？ [英] What is the performance penalty of C++11 thread_local variables in GCC 4.8?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

GCC 4.8中C ++ 11 thread_local变量的性能损失是多少？ [英] What is the performance penalty of C++11 thread_local variables in GCC 4.8?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭