为什么纤薄的读写器独占锁性能优于共享锁? [英] Why slim reader/writer exclusive lock outperformance the shared one?

查看:42
本文介绍了为什么纤薄的读写器独占锁性能优于共享锁?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Windows Via C/C++中的代码在Windows 7下测试了slim reader/writer lock的性能.

结果让我惊讶的是,独占锁定表现出共享锁定.这是代码和结果.

unsigned int __stdcall slim_reader_writer_exclusive(void *arg){//SRWLOCK srwLock;//初始化SRWLock(&srwLock);for (int i = 0; i <1000000; ++i) {AcquireSRWLockExclusive(&srwLock);g_value = 0;释放SRWLockExclusive(&srwLock);}_endthreadex(0);返回0;}unsigned int __stdcall slim_reader_writer_shared(void *arg){国际b;for (int i = 0; i <1000000; ++i) {获取SRWLockShared(&srwLock);//b = g_value;g_value = 0;释放SRWLockShared(&srwLock);}_endthreadex(0);返回0;}

g_value 是全局 int volatile 变量.

你能解释一下为什么会发生这种情况吗?

解决方案

对于小型通用锁(如 SRWLocks,其大小只有一个指针),这是一个非常常见的结果.

关键要点:如果您的受保护代码段非常小,以至于锁本身的开销可能占主导地位,那么使用排他锁比使用共享锁更好.>

此外,Raymond Chen 关于 g_Value 争用的论点也是正确的.如果在这两种情况下都读取而不是写入 g_Value,您可能会注意到共享锁的好处.

详情:

SRW 锁是使用单个指针大小的原子变量实现的,该变量可以呈现多种不同状态,具体取决于低位的值.对这些位的使用方式的描述超出了本评论的范围——状态转换的数量相当多——所以,我将仅提及您在测试中可能遇到的几个状态.

初始锁定状态: (0, ControlBits:0) -- SRW 锁定开始时所有位都设置为 0.

共享状态: (ShareCount: n, ControlBits: 1) -- 当没有冲突的独占获取并且锁被持有共享时,共享计数直接存储在锁变量中.

独占状态: (ShareCount: 0, ControlBits: 1) -- 当没有冲突的共享获取或独占获取时,锁设置了一个低位.

竞争状态示例: (WaitPtr:ptr, ControlBits: 3) -- 当发生冲突时,等待锁的线程使用分配在等待线程上的数据形成一个队列'堆栈.lock 变量存储指向队列尾部的指针而不是共享计数.

在这个方案中,当您不知道初始状态时尝试获取排他锁是对锁字的单次写入,以设置低位并检索旧值(这可以在 x86 上使用LOCK BTS 指令).如果您成功了(就像您在 1 个线程的情况下所做的那样),您可以继续进入锁定区域,无需进一步操作.

尝试获取共享锁是一个比较复杂的操作:你需要先读取锁变量的初始值来确定旧的共享数,增加你读取的共享数,然后有条件地写回更新的值LOCK CMPXCHG 指令.这是一个明显更长的串行相关指令链,因此速度较慢.此外,CMPXCGH 在许多处理器上比 LOCK BTS 等无条件原子指令要慢一些.

理论上可以通过假设锁在开始时处于其初始状态并首先执行 LOCK CMPXCHG 来加速锁的第一次共享获取.这将加快锁的初始共享获取(在您的单线程情况下所有这些),但它会显着减慢锁已经共享并发生第二次共享获取的情况.

在释放锁时会发生一组类似的发散操作,因此管理共享状态的额外成本也在 ReleaseSRWLockShared 端支付.

I have tested the performance of slim reader/writer lock under windows 7 using the codefrom Windows Via C/C++.

The result surprised me that the exclusive lock out performance the shared one. Here are the code and the result.

unsigned int __stdcall slim_reader_writer_exclusive(void *arg)
{
    //SRWLOCK srwLock;
    //InitializeSRWLock(&srwLock);

    for (int i = 0; i < 1000000; ++i) {
        AcquireSRWLockExclusive(&srwLock);
        g_value = 0;
        ReleaseSRWLockExclusive(&srwLock);
    }
    _endthreadex(0);
    return 0;
}

unsigned int __stdcall slim_reader_writer_shared(void *arg)
{

    int b;
    for (int i = 0; i < 1000000; ++i) {
        AcquireSRWLockShared(&srwLock);
        //b = g_value;
        g_value = 0;
        ReleaseSRWLockShared(&srwLock);
    }
    _endthreadex(0);
    return 0;
}

g_value is a global int volatile variable.

Could you kindly explain why this could happen?

解决方案

This is a pretty common result for small general-purpose locks (like SRWLocks, which are only one pointer in size).

Key Takeaway: If you have an extremely small guarded section of code, such that the overhead of the lock itself might be dominant, an exclusive lock is better to use than a shared lock.

Also, Raymond Chen's argument about the contention on g_Value is true as well. If g_Value were read instead of written in both cases, you might notice a benefit for the shared lock.

Details:

The SRW lock is implemented using a single pointer-sized atomic variable which can take on a number of different states, depending on the values of the low bits. The description of the way these bits are used is out of scope for this comment--the number of state transitions is pretty high--so, I'll mention only a few states that you may be encountering in your test.

Initial lock state: (0, ControlBits:0) -- An SRW lock starts with all bits set to 0.

Shared state: (ShareCount: n, ControlBits: 1) -- When there is no conflicting exclusive acquire and the lock is held shared, the share count is stored directly in the lock variable.

Exclusive state: (ShareCount: 0, ControlBits: 1) -- When there is no conflicting shared acquire or exclusive acquire, the lock has a low bit set and nothing else.

Example contended state: (WaitPtr:ptr, ControlBits: 3) -- When there is a conflict, the threads that are waiting for the lock form a queue using data allocated on the waiting threads' stacks. The lock variable stores a pointer to the tail of the queue instead of a share count.

In this scheme, trying to acquire an exclusive lock when you don't know the initial state is a single write to the lock word, to set the low bit and retrieve the old value (this can be done on x86 with a LOCK BTS instruction). If you succeeded (as you always will do in the 1 thread case), you can proceed into the locked region with no further operations.

Trying to acquire a shared lock is a more involved operation: You need to first read the initial value of the lock variable to determine the old share count, increment the share count you read, and then write the updated value back conditionally with the LOCK CMPXCHG instruction. This is a noticeably longer chain of serially-dependent instructions, so it is slower. Also CMPXCGH is a bit slower on many processors than the unconditional atomic instructions like LOCK BTS.

It would be possible in theory to speed up the first shared acquire of a the lock by assuming that the lock was in its initial state at the beginning and performing the LOCK CMPXCHG first. This would speed up the initial shared acquire of the lock (all of them in your single-threaded case), but it would pretty significantly slow down the cases where the lock is already held shared and a second shared acquire occurs.

A similar set of divergent operations occurs when the lock is being released, so the extra cost of managing the shared state is also paid on the ReleaseSRWLockShared side.

这篇关于为什么纤薄的读写器独占锁性能优于共享锁?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆