What is the performance penalty of C++11 thread_local variables in GCC 4.8?
From the GCC 4.8 draft changelog:
G++ now implements the C++11
thread_local
keyword; this differs from the GNU__thread
keyword primarily in that it allows dynamic initialization and destruction semantics. Unfortunately, this support requires a run-time penalty for references to non-function-localthread_local
variables even if they don't need dynamic initialization, so users may want to continue to use__thread
for TLS variables with static initialization semantics.
What is precisely the nature and origin of this run-time penalty?
Obviously to support non-function-local thread_local
variables there needs to be a thread initialization phase before the entry to every thread main (just as there is a static initialization phase for global variables), but are they referring to some run-time penalty beyond that?
Roughly speaking what is the architecture of gcc's new implementation of thread_local?
(Disclaimer: I don't know much about the internals of GCC, so this is also an educated guess.)
The dynamic thread_local
initialization is added in commit 462819c. One of the change is:
* semantics.c (finish_id_expression): Replace use of thread_local
variable with a call to its wrapper.
So the run-time penalty is that, every reference of the thread_local
variable will become a function call. Let's check with a simple test case:
// 3.cpp
extern thread_local int tls;
int main() {
tls += 37; // line 6
tls &= 11; // line 7
tls ^= 3; // line 8
return 0;
}
// 4.cpp
thread_local int tls = 42;
When compiled*, we see that every use of the tls
reference becomes a function call to _ZTW3tls
, which lazily initialize the the variable once:
00000000004005b0 <main>:
main():
4005b0: 55 push rbp
4005b1: 48 89 e5 mov rbp,rsp
4005b4: e8 26 00 00 00 call 4005df <_ZTW3tls> // line 6
4005b9: 8b 10 mov edx,DWORD PTR [rax]
4005bb: 83 c2 25 add edx,0x25
4005be: 89 10 mov DWORD PTR [rax],edx
4005c0: e8 1a 00 00 00 call 4005df <_ZTW3tls> // line 7
4005c5: 8b 10 mov edx,DWORD PTR [rax]
4005c7: 83 e2 0b and edx,0xb
4005ca: 89 10 mov DWORD PTR [rax],edx
4005cc: e8 0e 00 00 00 call 4005df <_ZTW3tls> // line 8
4005d1: 8b 10 mov edx,DWORD PTR [rax]
4005d3: 83 f2 03 xor edx,0x3
4005d6: 89 10 mov DWORD PTR [rax],edx
4005d8: b8 00 00 00 00 mov eax,0x0 // line 9
4005dd: 5d pop rbp
4005de: c3 ret
00000000004005df <_ZTW3tls>:
_ZTW3tls():
4005df: 55 push rbp
4005e0: 48 89 e5 mov rbp,rsp
4005e3: b8 00 00 00 00 mov eax,0x0
4005e8: 48 85 c0 test rax,rax
4005eb: 74 05 je 4005f2 <_ZTW3tls+0x13>
4005ed: e8 0e fa bf ff call 0 <tls> // initialize the TLS
4005f2: 64 48 8b 14 25 00 00 00 00 mov rdx,QWORD PTR fs:0x0
4005fb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
400602: 48 01 d0 add rax,rdx
400605: 5d pop rbp
400606: c3 ret
Compare it with the __thread
version, which won't have this extra wrapper:
00000000004005b0 <main>:
main():
4005b0: 55 push rbp
4005b1: 48 89 e5 mov rbp,rsp
4005b4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 6
4005bb: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005be: 8d 50 25 lea edx,[rax+0x25]
4005c1: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005c8: 64 89 10 mov DWORD PTR fs:[rax],edx
4005cb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 7
4005d2: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005d5: 89 c2 mov edx,eax
4005d7: 83 e2 0b and edx,0xb
4005da: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005e1: 64 89 10 mov DWORD PTR fs:[rax],edx
4005e4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 8
4005eb: 64 8b 00 mov eax,DWORD PTR fs:[rax]
4005ee: 89 c2 mov edx,eax
4005f0: 83 f2 03 xor edx,0x3
4005f3: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc
4005fa: 64 89 10 mov DWORD PTR fs:[rax],edx
4005fd: b8 00 00 00 00 mov eax,0x0 // line 9
400602: 5d pop rbp
400603: c3 ret
This wrapper is not needed for in every use case of thread_local
though. This can be revealed from decl2.c
.
The wrapper is generated only when:
-
It is not function-local, and,
- It is
extern
(the example shown above), or - The type has a non-trivial destructor (which is not allowed for
__thread
variables), or - The type variable is initialized by a non-constant-expression (which is also not allowed for
__thread
variables).
- It is
In all other use cases, it behaves the same as __thread
. That means, unless you have some extern __thread
variables, you could replace all __thread
by thread_local
without any loss of performance.
*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.
C++11 thread_local has the same runtime effect as the __thread specifier (__thread
is not part of the C standard; thread_local
is part of the C++ standard)
it depends where the TLS variable (declared with __thread
specifier) is declared.
- if TLS variable is declared in an executable then access is fast
- if TLS variable is declared within shared library code (compiled with
-fPIC
compiler option) and-ftls-model=initial-exec
compiler option is specified then access is fast; however the following limitation applies: the shared library can't be loaded via dlopen/dlsym (dynamic loading), the only way of using the library is to link with it during compilation (linker option-l<libraryname>
) - if TLS variable is declared within a shared library (
-fPIC
compiler option set) then access is very slow, as the general dynamic TLS model is assumed - here each access to a TLS variable results in a call to_tls_get_addr()
; this is the default case because you are not limited in the way that the shared library is used.
Sources: ELF Handling For Thread-Local Storage by Ulrich Drepper https://www.akkadia.org/drepper/tls.pdf this text also lists the code that is generated for the supported target platforms.
If the variable is defined in the current TU, the inliner will take care of the overhead. I expect that this will be true of most uses of thread_local.
For extern variables, if the programmer can be sure that no use of the variable in a non-defining TU needs to trigger dynamic initialization (either because the variable is statically initialized, or a use of the variable in the defining TU will be executed before any uses in another TU), they can avoid this overhead with the -fno-extern-tls-init option.