boost ::线程数据结构大小上荒谬的一面? [英] boost::thread data structure sizes on the ridiculous side?

查看:225
本文介绍了boost ::线程数据结构大小上荒谬的一面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编译器:linux上的clang ++ x86-64。



自从我编写了任何复杂的低级系统代码和我对系统的ussualy程序原语(windows和pthreads / posix)。所以,在#和外的已经从我的记忆。我现在使用 boost :: asio boost :: thread



为了针对异步函数执行器模拟同步RPC( boost :: io_service $ c> io :: service :: run 'ing其中请求是 io_serviced :: post 'ed),我使用boost同步原语。为了好奇,我决定 sizeof 原语。这是我看到的。

  struct notification_object 
{
bool ready;
boost :: Mutex m;
boost :: condition_variable v;
};
...
std :: cout<< sizeof(bool) std :: endl;
std :: cout<< sizeof(boost :: mutex)< std :: endl;
std :: cout<< sizeof(boost :: condition_variable)<< std :: endl;
std :: cout<< sizeof(notification_object)<< std :: endl;
...

输出:

  1 
40
88
136

互斥体的40个字节? ?? ? WTF! 88 for condition_variable!请记住,我被这个this肿的大小,因为我在想一个应用程序,可以创建数百 notification_object



这种可移植性的开销似乎是可笑的,有人可以证明这一点吗?据我记得,这些原语应该是4或8字节宽,这取决于CPU的内存模型。

解决方案

当您查看任何类型的同步原语的大小开销时,请记住这些 / em>包装得太紧。这是因为例如。如果两个互斥体在使用中并发使用,即使获取这些锁的用户从不冲突,共享一个缓存行的两个互斥体也会在缓冲存储(虚假共享)中结束。也就是说想象两个线程运行两个循环:

  for(;;){
lock(lockA);
unlock(lockA);
}

  for(;;){
lock(lockB);
unlock(lockB);
}

在两个不同的线程上运行时,一个线程运行一个循环当且仅当两个锁不在同一缓存行内时。如果 lockA lockB 在同一缓存行中,每个线程的迭代次数一半 - 因为具有这两个锁的高速缓存行将在执行这两个线程的CPU核之间永久地反弹。因此,即使实际数据大小自旋锁或互斥体下面的原始数据类型可能只是一个字节或32位字,这样的对象的有效数据大小通常较大。



在断言我的互斥量太大之前记住这一点。事实上,在x86 / x64上,40字节太小了 以防止错误共享,因为缓存线当前至少有64个字节。



除此之外,如果你非常关心内存使用,考虑通知对象不需要是唯一的 - 条件变量可以用于触发不同的事件(通过谓词 code> boost :: condition_variable 知道)。因此,对于整个状态机,可以使用单个互斥体/ CV对,而不是每个状态一个。例如:线程池同步 - 具有比线程更多的锁并不一定有益。



编辑通过在同一高速缓存行中托管多个原子性更新的变量而导致的负面性能影响),参见(其中)以下SO发布:





如上所述,在多核,每核缓存配置中使用多个同步对象(无论是原子性更新的变量,锁定,信号量等) ,允许它们中的每一个单独的高速缓存线的空间。你在这里交易内存使用的可扩展性,但是真的,如果你进入你的软件需要几百万锁(使GB的内存)的地区,你有资金几百GB的内存(和



在大多数情况下(锁/原子对于<$ c的特定实例) $ c> class / struct ),只要包含原子变量的对象实例足够大,就可以自由获得padding 。


Compiler: clang++ x86-64 on linux.

It has been a while since I have written any intricate low level system code, and I ussualy program against the system primitives (windows and pthreads/posix). So, the in#s and out's have slipped from my memory. I am working with boost::asio and boost::thread at the moment.

In order to emulate synchronous RPC against an asynchronous function executor (boost::io_service with multiple threads io::service::run'ing where requests are io_serviced::post'ed), I am using boost synchronization primitives. For curiosities sake I decided to sizeof the primitives. This is what I get to see.

struct notification_object
{
  bool ready;
  boost::mutex m;
  boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...

Output:

1
40
88
136

Forty bytes for a mutex ?? ?? ? WTF ! 88 for a condition_variable !!! Please keep in mind that I'm repulsed by this bloated size because I am thinking of an application that could create hundreds of notification_object's

This level of overhead for portability seems ridiculous, can someone justify this? As far as I can remember these primitives should be 4 or 8 bytes wide depending on the memory model of the CPU.

解决方案

When you look at "size overhead" for any type of synchronization primitive, keep in mind that these cannot be packed too closely. That is so because e.g. two mutexes sharing a cacheline would end up in cache trashing (false sharing) if they're in-use concurrently, even if the users acquiring these locks never "conflict". I.e. imagine two threads running two loops:

for (;;) {
    lock(lockA);
    unlock(lockA);
}

and

for (;;) {
    lock(lockB);
    unlock(lockB);
}

You will see twice the number of iterations when run on two different threads compared to one thread running one loop if and only if the two locks are not within the same cacheline. If lockA and lockB are in the same cacheline, the number of iterations per thread will half - because the cacheline with those two locks in will permanently bounce between the cpu cores executing these two threads.

Hence even though the actual data size of the primitive data type underlying a spinlock or mutex might only be a byte or a 32bit word, the effective data size of such an object is often larger.

Keep that in mind before asserting "my mutexes are too large". In fact, on x86/x64, 40 Bytes is too small to prevent false sharing, as cachelines there are currently at least 64 Bytes.

Beyond that, if you're highly concerned about memory usage, consider that notification objects need not be unique - condition variables can serve to trigger for different events (via the predicate that boost::condition_variable knows about). It'd therefore be possible to use a single mutex/CV pair for a whole state machine instead of one such pair per state. Same goes for e.g. thread pool synchronization - having more locks than threads is not necessarily beneficial.

Edit: For a few more references on "false sharing" (and the negative performance impact caused by hosting multiple atomically-updated variables within the same cacheline), see (amongst others) the following SO postings:

As said, when using multiple "synchronization objects" (whether that'd be atomically-updated variables, locks, semaphores, ...) in a multi-core, cache-per-core config, allow each of them a separate cacheline of space. You're trading memory usage for scalability here, but really, if you get into the region where your software needs several millions of locks (making that GBs of mem), you either have the funding for a few hundred GB of memory (and a hundred CPU cores), or you're doing something wrong in your software design.

In most cases (a lock / an atomic for a specific instance of a class / struct), you get the "padding" for free as long as the object instance that contains the atomic variable is large enough.

这篇关于boost ::线程数据结构大小上荒谬的一面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆