GCC 4.8中C ++ 11 thread_local变量的性能损失是多少? [英] What is the performance penalty of C++11 thread_local variables in GCC 4.8?
问题描述
G ++现在执行 C ++ 11
thread_local
关键字;这不同于
GNU__ thread
关键字,因为它允许动态
初始化和销毁语义。不幸的是,即使这些支持不需要动态初始化,对
的支持也需要对非本函数本地
thread_local
变量的运行时惩罚,
,因此用户可能希望继续对带有
静态初始化语义的TLS变量使用__ thread
。
这个运行时间惩罚的性质和起源究竟是什么?
显然要支持非函数本地 thread_local
变量需要在进入每个线程主体之前进行线程初始化阶段(就像全局变量有一个静态初始化阶段一样),但是它们是指一些运行大概是什么是gcc的thread_local新实现的体系结构?
(免责声明:我不太了解GCC的内部,所以这也是一个有教养的猜测。)
* semantics.c(finish_id_expression):替换使用thread_local $ c
$ code $>变量与调用它的包装。
$ b $因此,运行时惩罚是, thread_local
变量的每个引用都将成为一个函数调用。让我们来看一个简单的测试用例:
pre $ c
extern thread_local int tls;
int main(){
tls + = 37; //第6行
tls& = 11; //第7行
tls ^ = 3; //第8行
返回0;
}
// 4.cpp
thread_local int tls = 42;
当编译*时,我们看到每个使用 tls
引用成为对 _ZTW3tls
的函数调用,它会一次性地初始化该变量:
00000000004005b0< main> ;:
main():
4005b0:55 push rbp
4005b1:48 89 e5 mov rbp,rsp
4005b4:e8 26 00 00 00致电4005df< _ZTW3tls> //第6行
4005b9:8b 10 mov edx,DWORD PTR [rax]
4005bb:83 c2 25 add edx,0x25
4005be:89 10 mov DWORD PTR [rax],edx
4005c0:e8 1a 00 00 00致电4005df< _ZTW3tls> //第7行
4005c5:8b 10 mov edx,DWORD PTR [rax]
4005c7:83 e2 0b and edx,0xb
4005ca:89 10 mov DWORD PTR [rax],edx
4005cc:e8 0e 00 00 00致电4005df< _ZTW3tls> //第8行
4005d1:8b 10 mov edx,DWORD PTR [rax]
4005d3:83 f2 03 xor edx,0x3
4005d6:89 10 mov DWORD PTR [rax],edx
4005d8:b8 00 00 00 00 mov eax,0x0 //第9行
4005dd:5d pop rbp
4005de:c3 ret
00000000004005df< _ZTW3tls>:
$ b 4005e3:b8 00 00 00 00 mov eax,0x0
4005e8:
4005df:55 push rbp
4005e0:48 89 e5 mov rbp,rsp
4005e8: 48 85 c0 test rax,rax
4005eb:74 05 je 4005f2< _ZTW3tls + 0x13>
4005ed:e8 0e fa bf ff call 0< tls> //初始化TLS
4005f2:64 48 8b 14 25 00 00 00 00 mov rdx,QWORD PTR fs:0x0
4005fb:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
400602 :48 01 d0 add rax,rdx
400605:5d pop rbp
400606:c3 ret
将它与 __ thread
版本进行比较,该版本没有这个额外的包装:
00000000004005b0< main> ;:
main():
4005b0:55 push rbp
4005b1:48 89 e5 mov rbp,rsp
4005b4:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 6
4005bb:64 8b 00 mov eax,DWORD PTR fs:[rax]
4005be:8d 50 25 lea edx,[rax + 0x25]
4005c1:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005c8:64 89 10 mov DWORD PTR fs:[rax],edx
4005cb:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 7
4005d2:64 8b 00 mov eax,DWORD PTR fs:[rax]
4005d5:89 c2 mov edx,eax
4005d7:83 e2 0b and edx,0xb
4005da:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005e1:64 89 10 mov DWORD PTR fs:[rax],edx
4005e4:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 8
4005eb:64 8b 00 mov eax,DWORD PTR fs:[rax]
4005ee:89 c2 mov edx,eax
4005f0:83 f2 03 xor edx,0x3
4005f3:48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005fa:64 89 10 mov DWORD PTR fs:[rax],edx
4005fd:b8 00 00 00 00 mov eax,0x0 //第9行
400602:5d pop rbp
400603:c3 ret
这个包装在每个使用情况下都不需要不过, 它不是函数局部的,和 thread_local
。这可以从 decl2.c
。
仅在以下情况下生成包装:
- 它是
extern
(上面的例子) li>
- 这个类型有一个非平凡的析构函数(不允许
__ thread
变量),或者
- 类型变量由非常量表达式初始化(对于
__ thread
变量也是不允许的)。
- 类型变量由非常量表达式初始化(对于
It is not function-local, and,
- It is
extern
(the example shown above), or - The type has a non-trivial destructor (which is not allowed for
__thread
variables), or - The type variable is initialized by a non-constant-expression (which is also not allowed for
__thread
variables).
- It is
在所有其他用例中,它的行为与 __螺纹
。这意味着,除非你有一些 extern __thread
变量,你可以用 thread_local替换所有
没有任何性能损失。 __ thread
*:我使用-O0进行编译,因为内联函数会使函数边界更不明显。即使我们转向-O3,那些初始化检查仍然存在。
From the GCC 4.8 draft changelog:
G++ now implements the C++11
thread_local
keyword; this differs from the GNU__thread
keyword primarily in that it allows dynamic initialization and destruction semantics. Unfortunately, this support requires a run-time penalty for references to non-function-localthread_local
variables even if they don't need dynamic initialization, so users may want to continue to use__thread
for TLS variables with static initialization semantics.
What is precisely the nature and origin of this run-time penalty?
Obviously to support non-function-local thread_local
variables there needs to be a thread initialization phase before the entry to every thread main (just as there is a static initialization phase for global variables), but are they referring to some run-time penalty beyond that?
Roughly speaking what is the architecture of gcc's new implementation of thread_local?
(Disclaimer: I don't know much about the internals of GCC, so this is also an educated guess.)
The dynamic thread_local
initialization is added in commit 462819c. One of the change is:
* semantics.c (finish_id_expression): Replace use of thread_local
variable with a call to its wrapper.
So the run-time penalty is that, every reference of the thread_local
variable will become a function call. Let's check with a simple test case:
// 3.cpp
extern thread_local int tls;
int main() {
tls += 37; // line 6
tls &= 11; // line 7
tls ^= 3; // line 8
return 0;
}
// 4.cpp
thread_local int tls = 42;
When compiled*, we see that every use of the tls
reference becomes a function call to _ZTW3tls
, which lazily initialize the the variable once:
00000000004005b0 <main>:
main():
4005b0: 55 push rbp
4005b1: 48 89 e5 mov rbp,rsp
4005b4: e8 26 00 00 00 call 4005df <_ZTW3tls> // line 6
4005b9: 8b 10 mov edx,DWORD PTR [rax]
4005bb: 83 c2 25 add edx,0x25
4005be: 89 10 mov DWORD PTR [rax],edx
4005c0: e8 1a 00 00 00 call 4005df <_ZTW3tls> // line 7
4005c5: 8b 10 mov edx,DWORD PTR [rax]
4005c7: 83 e2 0b and edx,0xb
4005ca: 89 10 mov DWORD PTR [rax],edx
4005cc: e8 0e 00 00 00 call 4005df <_ZTW3tls> // line 8
4005d1: 8b 10 mov edx,DWORD PTR [rax]
4005d3: 83 f2 03 xor edx,0x3
4005d6: 89 10 mov DWORD PTR [rax],edx
4005d8: b8 00 00 00 00 mov eax,0x0 // line 9
4005dd: 5d pop rbp
4005de: c3 ret
00000000004005df <_ZTW3tls>:
_ZTW3tls():
4005df: 55 push rbp
4005e0: 48 89 e5 mov rbp,rsp
4005e3: b8 00 00 00 00 mov eax,0x0
4005e8: 48 85 c0 test rax,rax
4005eb: 74 05 je 4005f2 <_ZTW3tls+0x13>
4005ed: e8 0e fa bf ff call 0 <tls> // initialize the TLS
4005f2: 64 48 8b 14 25 00 00 00 00 mov rdx,QWORD PTR fs:0x0
4005fb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
400602: 48 01 d0 add rax,rdx
400605: 5d pop rbp
400606: c3 ret
Compare it with the __thread
version, which won't have this extra wrapper:
00000000004005b0 <main>:
main():
4005b0: 55 push rbp
4005b1: 48 89 e5 mov rbp,rsp
4005b4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 6
4005bb: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005be: 8d 50 25 lea edx,[rax+0x25]
4005c1: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005c8: 64 89 10 mov DWORD PTR fs:[rax],edx
4005cb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 7
4005d2: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005d5: 89 c2 mov edx,eax
4005d7: 83 e2 0b and edx,0xb
4005da: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005e1: 64 89 10 mov DWORD PTR fs:[rax],edx
4005e4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 8
4005eb: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005ee: 89 c2 mov edx,eax
4005f0: 83 f2 03 xor edx,0x3
4005f3: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005fa: 64 89 10 mov DWORD PTR fs:[rax],edx
4005fd: b8 00 00 00 00 mov eax,0x0 // line 9
400602: 5d pop rbp
400603: c3 ret
This wrapper is not needed for in every use case of thread_local
though. This can be revealed from decl2.c
.
The wrapper is generated only when:
In all other use cases, it behaves the same as __thread
. That means, unless you have some extern __thread
variables, you could replace all __thread
by thread_local
without any loss of performance.
*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.
这篇关于GCC 4.8中C ++ 11 thread_local变量的性能损失是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!