为什么线程本地存储这么慢? [英] Why is thread local storage so slow?

查看:85
本文介绍了为什么线程本地存储这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为D编程语言开发自定义标记释放样式的内存分配器,该分配器通过从线程局部区域分配来工作.与在其他方面相同的单线程版本代码相比,线程本地存储瓶颈似乎导致从这些区域分配内存的速度大幅度降低(〜50%),即使在设计我的代码后每次分配/只能进行一次TLS查找/释放.这是基于在循环中大量分配/释放内存的,而我正在尝试弄清这是否是我的基准测试方法的产物.我的理解是,线程本地存储基本上应该只涉及通过额外的间接层访问某些内容,类似于通过指针访问变量.这不正确吗?线程本地存储通常有多少开销?

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?

注意:尽管我提到了D,但我也对非D的通用答案感兴趣,因为如果D的线程本地存储实现比最佳实现慢的话,它的实现可能会得到改善.

Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

推荐答案

速度取决于TLS的实现.

The speed depends on the TLS implementation.

是的,您是正确的,TLS可以和指针查找一样快.在带有内存管理单元的系统上,它甚至可以更快.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

对于指针查找,您需要调度程序的帮助.调度程序必须-在任务开关上-更新指向TLS数据的指针.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

另一种实现TLS的快速方法是通过内存管理单元.此处TLS被视为与其他任何数据一样,不同之处在于TLS变量是在特殊段中分配的.调度程序将在任务切换时将正确的内存块映射到任务的地址空间.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

如果调度程序不支持上述任何方法,则编译器/库必须执行以下操作:

If the scheduler does not support any of these methods, the compiler/library has to do the following:

  • 获取当前的ThreadId
  • 采取信号量
  • 通过ThreadId查找指向TLS块的指针(可能使用映射)
  • 释放信号量
  • 返回该指针.

显然,每次TLS数据访问都需要花费一些时间,并且可能最多需要进行三个操作系统调用:获取ThreadId,获取并释放信号量.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

需要信号量,以确保当另一个线程正在派生新线程时,没有线程从TLS指针列表中读取. (并因此分配一个新的TLS块并修改数据结构).

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

不幸的是,在实践中看到缓慢的TLS实施并不少见.

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

这篇关于为什么线程本地存储这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆