自旋锁是否总是需要内存屏障?在内存屏障上旋转是否昂贵? [英] Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

查看:256
本文介绍了自旋锁是否总是需要内存屏障?在内存屏障上旋转是否昂贵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一些无锁代码,可以在本地使用 在大多数情况下都可以读取.

I wrote some lock-free code that works fine with local reads, under most conditions.

在内存上进行局部旋转是否必然意味着我 必须始终在旋转之前插入存储屏障 阅读吗?

Does local spinning on a memory read necessarily imply I have to ALWAYS insert a memory barrier before the spinning read?

(为验证这一点,我设法产生了一个读取器/写入器 导致读者永远看不到 书面价值,在某些非常具体的情况下 条件-专用CPU,附加到CPU的进程, 优化器一路调高,没有其他工作 循环-箭头确实指向该方向,但我不是 完全确定通过内存旋转的成本 障碍.)

(To validate this, I managed to produce a reader/writer combination which results in a reader never seeing the written value, under certain very specific conditions--dedicated CPU, process attached to CPU, optimizer turned all the way up, no other work done in the loop--so the arrows do point in that direction, but I'm not entirely sure about the cost of spinning through a memory barrier.)

如果遇到内存障碍,旋转的成本是多少 缓存的存储缓冲区中没有要刷新的内容? 也就是说,所有过程(在C中)都是

What is the cost of spinning through a memory barrier if there is nothing to be flushed in the cache's store buffer? i.e., all the process is doing (in C) is

while ( 1 ) {
    __sync_synchronize();
    v = value;
    if ( v != 0 ) {
        ... something ...
    }
}

我正确地假设它是免费的并且不会妨碍您 内存总线有流量吗?

Am I correct to assume that it's free and it won't encumber the memory bus with any traffic?

另一种表达方式是问:内存屏障是否可以 除了:刷新存储缓冲区,应用 无效,并防止编译器 重新排序其位置上的读取/写入?

Another way to put this is to ask: does a memory barrier do anything more than: flush the store buffer, apply the invalidations to it, and prevent the compiler from reordering reads/writes across its location?

反汇编,__ sync_synchronize()似乎可以翻译为:

Disassembling, __sync_synchronize() appears to translate into:

lock orl

摘自Intel手册(对于新手而言,它还很模糊):

From the Intel manual (similarly nebulous for the neophyte):

Volume 3A: System Programming Guide, Part 1 --   8.1.2

Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.

[...]

For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, "Effects of a LOCK Operation on
Internal Processor Caches").

我的翻译:当您说LOCK时,这会很昂贵,但是我们 仅在必要时这样做."

My translation: "when you say LOCK, this would be expensive, but we're only doing it where necessary."

@BlankXavier:

@BlankXavier:

我确实进行了测试,如果写入器没有明确地从存储缓冲区中推出写入操作,并且它是该CPU上唯一运行的进程,那么读取器可能永远不会看到写入器的效果(我可以使用测试程序来重现它,但是如上所述,它仅在特定的测试,特定的编译选项和专用的核心分配下发生–我的算法运行良好,只有当我对它的工作原理和编写感到好奇时我意识到的明确测试可能会在将来产生问题.

I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).

我认为默认情况下,简单写操作是WB写操作(写回),这意味着它们不会立即被清除,但读操作将采用其最新值(我认为它们称为存储转发").因此,我使用CAS指令编写程序.我在Intel手册中发现了所有这些不同类型的写入实现(UC,WC,WT,WB,WP),Intel vol 3A第11-10章,仍在学习中.

I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.

我的不确定性在读者方面:我从McKenney的论文中了解到,还有一个失效队列,即从总线进入缓存的传入失效队列.我不确定这部分是如何工作的.特别是,您似乎暗示着,循环遍历普通读取(即,非锁定,无障碍,并且仅使用volatile来确保优化器一旦编译就离开了读取)将每次都进入无效队列" (如果存在这种情况).如果简单的读取还不够好(例如,可以读取一条旧的高速缓存行,但在排队无效之前仍然有效(这对我来说听起来有点不连贯,但是无效队列如何工作?),那么原子读取就可以了是必要的,我的问题是:在这种情况下,这会对公交车产生影响吗? (我想可能不是.)

My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)

我仍在阅读《英特尔手册》,虽然我看到了有关存储转发的大量讨论,但没有找到关于无效队列的很好讨论.我决定将我的C代码转换为ASM并进行实验,我认为这是真正了解其工作方式的最佳方法.

I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

推荐答案

"xchg reg,[mem]"指令将通过内核的LOCK引脚发出其锁定意图的信号.该信号绕过其他内核,并向下缓存到总线主控总线(PCI变体等),总线将完成其工作,最终LOCKA(确认)引脚将向CPU发出信号,通知xchg可能已完成.然后,LOCK信号被关闭.该序列可能需要很长时间(数百个CPU周期或更多)来完成.之后,其他核心的相应缓存行将失效,并且您将拥有一个已知状态,即该状态在核心之间已经同步.

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.

xchg指令是实现原子锁所必需的.如果锁定本身成功,则您可以访问定义了锁定以控制访问权限的资源.这样的资源可以是存储区,文件,设备,功能或您拥有的资源.尽管如此,程序员始终要编写代码来使用该资源,该资源在锁定时使用,而在没有锁定时则不使用.通常,成功锁定后的代码序列应尽可能短,以使其他代码尽可能少地阻碍对资源的访问.

The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.

请记住,如果锁定不成功,则需要通过发出新的xchg再次尝试.

Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.

无锁"是一个吸引人的概念,但它需要消除共享资源.如果您的应用程序有两个或多个内核同时读取和写入公共内存地址,则无锁"是不可行的.

"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

这篇关于自旋锁是否总是需要内存屏障?在内存屏障上旋转是否昂贵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆