为什么 __sync_add_and_fetch 在 32 位系统上适用于 64 位变量? [英] Why does __sync_add_and_fetch work for a 64 bit variable on a 32 bit system?

查看:20
本文介绍了为什么 __sync_add_and_fetch 在 32 位系统上适用于 64 位变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下压缩代码:

/* Compile: gcc -pthread -m32 -ansi x.c */
#include <stdio.h>
#include <inttypes.h>
#include <pthread.h>

static volatile uint64_t v = 0;

void *func (void *x) {
    __sync_add_and_fetch (&v, 1);
    return x;
}

int main (void) {
    pthread_t t;
    pthread_create (&t, NULL, func, NULL);
    pthread_join (t, NULL);
    printf ("v = %"PRIu64"
", v);
    return 0;
}

我有一个希望自动递增的 uint64_t 变量,因为该变量是多线程程序中的计数器.为了实现原子性,我使用了 GCC 的 atomic builtins.

I have a uint64_t variable that I want to increment atomically, because the variable is a counter in a multi-threaded program. To achieve the atomicity I use GCC's atomic builtins.

如果我为 amd64 系统 (-m64) 编译,生成的汇编代码很容易理解.通过使用lock addq,处理器保证增量是原子的.

If I compile for an amd64 system (-m64) the produced assembler code is easy to understand. By using a lock addq, the processor guarantees the increment to be atomic.

 400660:       f0 48 83 05 d7 09 20    lock addq $0x1,0x2009d7(%rip)

但同样的 C 代码在 ia32 系统 (-m32) 上会产生非常复杂的 ASM 代码:

But the same C code produces a very complicated ASM code on an ia32 system (-m32):

804855a:       a1 28 a0 04 08          mov    0x804a028,%eax
804855f:       8b 15 2c a0 04 08       mov    0x804a02c,%edx
8048565:       89 c1                   mov    %eax,%ecx
8048567:       89 d3                   mov    %edx,%ebx
8048569:       83 c1 01                add    $0x1,%ecx
804856c:       83 d3 00                adc    $0x0,%ebx
804856f:       89 ce                   mov    %ecx,%esi
8048571:       89 d9                   mov    %ebx,%ecx
8048573:       89 f3                   mov    %esi,%ebx
8048575:       f0 0f c7 0d 28 a0 04    lock cmpxchg8b 0x804a028
804857c:       08 
804857d:       75 e6                   jne    8048565 <func+0x15>

这是我不明白的:

  • lock cmpxchg8b 确实保证更改仅当预期值仍驻留在目标地址中时才写入变量.保证比较和交换以原子方式发生.
  • 但是什么保证读取 0x804855a 和 0x804855f 中的变量是原子的?
  • lock cmpxchg8b does guarantee that the changed variable is only written if the expected value still resides in the target address. The compare-and-swap is guaranteed to happen atomically.
  • But what guarantees that the reading of the variable in 0x804855a and 0x804855f to be atomic?

是否存在脏读"可能无关紧要,但有人可以概述一个简短的证明来证明没有问题吗?

Probably it does not matter if there was a "dirty read", but could someone please outline a short proof that there is no problem?

进一步:为什么生成的代码会跳回0x8048565而不是0x804855a?我很肯定这只有在其他作者也只增加变量的情况下才是正确的.这是对 __sync_add_and_fetch 函数的隐含要求吗?

Further: Why does the generated code jump back to 0x8048565 and not 0x804855a? I am positive that this is only correct if other writers, too, only increment the variable. Is this an implicated requirement for the __sync_add_and_fetch function?

推荐答案

initial 使用 2 个单独的 mov 指令读取是原子的,但它不在循环中.@interjay 的回答解释了为什么这很好.

The initial read with 2 separate mov instructions is not atomic, but it's not in the loop. @interjay's answer explains why this is fine.

有趣的事实:即使没有 lock 前缀,由 cmpxchg8b 完成的读取也是原子的.(但是这段代码确实使用了一个lock前缀来使整个RMW操作原子化,而不是单独的原子加载和原子存储.)

Fun fact: the read done by cmpxchg8b would be atomic even without a lock prefix. (But this code does use a lock prefix to make the entire RMW operation atomic, rather than separate atomic load and atomic store.)

保证它是原子的,因为它被正确对齐(并且它适合一个缓存行)并且因为英特尔以这种方式制定了规范,请参阅英特尔架构手册第 1 卷,第 4.4.1 节:

It's guaranteed to be atomic due to it being aligned correctly (and it fits on one cache line) and because Intel made the spec this way, see the Intel Architecture manual Vol 1, 4.4.1:

跨越 4 字节边界的字或双字操作数或考虑跨越 8 字节边界的四字操作数未对齐,需要两个单独的内存总线周期才能访问.

A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access.

第 3A 卷 8.1.1:

Vol 3A 8.1.1:

奔腾处理器(以及之后的更新处理器)保证将始终执行以下额外的内存操作原子地:

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

• 读取或写入在 64 位上对齐的四字边界

• Reading or writing a quadword aligned on a 64-bit boundary

• 对适合的未缓存内存位置进行 16 位访问在 32 位数据总线内

• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

P6 系列处理器(以及更新的处理器自)保证以下附加内存操作将始终以原子方式执行:

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

• 未对齐的 16-、32-、以及对适合缓存行的缓存内存进行 64 位访问

• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

因此通过对齐,它可以在 1 个周期内被读取,并且它适合一个缓存行,使 cmpxchg8b 的读取原子.

Thus by being aligned, it can be read in 1 cycle, and it fits into one cache line making cmpxchg8b's read atomic.

如果数据未对齐,lock 前缀会仍然使其成为原子数据,但性能成本会非常高,因为简单的缓存锁定(延迟对该缓存行的 MESI Invalidate 请求的响应)将不再足够.

If the data had been misaligned, the lock prefix would still make it atomic, but the performance cost would be very high because a simple cache-lock (delaying response to MESI Invalidate requests for that one cache line) would no longer be sufficient.

代码跳转回0x8048565(在mov加载后,包括copy和add-1),因为v已经加载;无需再次加载它,因为如果失败,CMPXCHG8B 会将 EAX:EDX 设置为目标中的值:

The code jumps back to 0x8048565 (after the mov loads, including the copy and add-1) because v has already been loaded; there is no need to load it again as CMPXCHG8B will set EAX:EDX to the value in the destination if it fails:

CMPXCHG8B 英特尔 ISA 手册卷的说明.2A:

CMPXCHG8B Description for the Intel ISA manual Vol. 2A:

将 EDX:EAX 与 m64 进行比较.如果相等,则设置 ZF 并将 ECX:EBX 加载到 m64 中.否则,清除 ZF 并将 m64 加载到 EDX:EAX.

Compare EDX:EAX with m64. If equal, set ZF and load ECX:EBX into m64. Else, clear ZF and load m64 into EDX:EAX.

因此代码只需要增加新返回的值并重试.如果我们在 C 代码中查看它会变得更容易:

Thus the code needs only to increment the newly returned value and try again. If we look at this in C code it becomes easier:

value = dest;                    // non-atomic but usually won't tear
while(!CAS8B(&dest,value,value + 1))
{
    value = dest;                // atomic; part of lock cmpxchg8b
}

value = dest 实际上来自与 cmpxchg8b 用于比较部分的相同读取.循环内没有单独的重新加载.

The value = dest is actually from the same read that cmpxchg8b used for the compare part. There isn't a separate reload inside the loop.

其实C11 atomic_compare_exchange_weak/_strong 内置了这种行为:它更新预期"操作数.

In fact, C11 atomic_compare_exchange_weak / _strong has this behaviour built-in: it updates the "expected" operand.

gcc 的现代内置 __atomic_compare_exchange_n (type *ptr, type*expected, type desired, bool weak, int success_memorder, int failure_memorder) - 它通过引用获取 expected 值.

So does gcc's modern builtin __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder) - it takes the expected value by reference.

使用 GCC 的 较旧的过时 __sync builtins, __sync_val_compare_and_swap 返回旧的 val(而不是 __sync_bool_compare_and_swap 的布尔交换/未交换结果)

With GCC's older obsolete __sync builtins, __sync_val_compare_and_swap returns the old val (instead of a boolean swapped / didn't-swap result for __sync_bool_compare_and_swap)

这篇关于为什么 __sync_add_and_fetch 在 32 位系统上适用于 64 位变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆