现代Intel上的C ++ 11:我疯了还是非原子对齐的64位加载/存储实际上是原子的? [英] C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?

查看:103
本文介绍了现代Intel上的C ++ 11:我疯了还是非原子对齐的64位加载/存储实际上是原子的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能否将一项关键任务应用程序基于此测试的结果,即100个线程读取一个主线程设置的十亿次指针永远不会流泪?

Can I base a mission-critical application on the results of this test, that 100 threads reading a pointer set a billion times by a main thread never see a tear?

除了撕扯,还有其他潜在的问题吗?

Any other potential problems doing this besides tearing?

这是一个使用g++ -g tear.cxx -o tear -pthread编译的独立演示.

Here's a stand-alone demo that compiles with g++ -g tear.cxx -o tear -pthread.

#include <atomic>
#include <thread>
#include <vector>

using namespace std;

void* pvTearTest;
atomic<int> iTears( 0 );

void TearTest( void ) {

  while (1) {
      void* pv = (void*) pvTearTest;

      intptr_t i = (intptr_t) pv;

      if ( ( i >> 32 ) != ( i & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %p\n", pv );
          iTears++;
      }
      if ( ( i >> 32 ) == 999999999 )
          break;

  }
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );

  for ( auto& thr: athr )
      thr.join();

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

实际的应用程序是一个malloc() ed数组,有时是一个realloc()'d数组(大小为2的幂;重新分配存储空间的两倍),许多子线程绝对会在关键任务中发挥作用,而且还具有高性能批判的方式.

The actual application is a malloc()'ed and sometimes realloc()'d array (size is power of two; realloc doubles storage) that many child threads will absolutely be hammering in a mission-critical but also high-performance-critical way.

线程有时会需要向该数组添加一个新条目,这将通过将下一个数组条目设置为指向某个对象,然后增加一个atomic<int> iCount来实现.最后,它将数据添加到某些数据结构中,这将导致其他线程尝试取消对该单元格的引用.

From time to time a thread will need to add a new entry to the array, and will do so by setting the next array entry to point to something, then increment an atomic<int> iCount. Finally it will add data to some data structures that would cause other threads to attempt to dereference that cell.

这一切似乎都很好(除非我确定肯定会在进行非原子更新之前发生计数的增加,否则我不是很肯定)... 除了的一件事:realloc()通常更改数组的地址,然后进一步释放旧的数组,该指针仍然对其他线程可见.

It all seems fine (except I'm not positive if the increment of count is assured of happening before following non-atomic updates)... except for one thing: realloc() will typically change the address of the array, and further frees the old one, the pointer to which is still visible to other threads.

好,所以我用malloc()代替了realloc()的新数组,手动复制内容,将指针设置为该数组.我会释放旧的数组,但是我意识到其他线程可能仍在访问它:它们读取数组的基数;我释放了基地;第三个线程分配它在其中写入其他内容;然后,第一个线程将索引的偏移量添加到基数,并期望一个有效的指针.我很乐意泄漏这些. (鉴于增长了两倍,所有合并的旧数组的大小都与当前数组相同,因此开销只是每个项目增加了16个字节,而且很快便不再引用它的内存了.)

OK, so instead of realloc(), I malloc() a new array, manually copy the contents, set the pointer to the array. I would free the old array but I realize other threads may still be accessing it: they read the array base; I free the base; a third thread allocates it writes something else there; the first thread then adds the indexed offset to the base and expects a valid pointer. I'm happy to leak those though. (Given the doubling growth, all old arrays combined are about the same size as the current array so overhead is simply an extra 16 bytes per item, and it's memory that soon is never referenced again.)

所以,这是问题的症结所在:一旦分配了更大的数组,就可以完全安全地使用非原子写入来写入基地址吗?还是尽管进行了十亿次访问测试,但实际上我是否必须使其成为原子<>并因此减慢所有工作线程读取该原子的速度?

So, here's the crux of the question: once I allocate the bigger array, can I write it's base address with a non-atomic write, in utter safety? Or despite my billion-access test, do I actually have to make it atomic<> and thus slow all worker threads to read that atomic?

(由于这肯定与环境有关,因此我们所说的是2012或更高版本的Intel,g ++ 4到9,以及2012年或以后的Red Hat.)

(As this is surely environment dependent, we're talking 2012-or-later Intel, g++ 4 to 9, and Red Hat of 2012 or later.)

这是一个经过修改的测试程序,与我的计划方案非常匹配,仅需少量写入.我还添加了一些读取.我看到从void *切换到atomic时,我从2240次读取/秒提高到660次读取/秒(禁用了优化).读取的机器语言显示在源代码之后.

here is a modified test program that matches my planned scenario much more closely, with only a small number of writes. I've also added a count of the reads. I see when switching from void* to atomic I go from 2240 reads/sec to 660 reads/sec (with optimization disabled). The machine language for the read is shown after the source.

#include <atomic>
#include <chrono>
#include <thread>
#include <vector>

using namespace std;

chrono::time_point<chrono::high_resolution_clock> tp1, tp2;

// void*: 1169.093u 0.027s 2:26.75 796.6% 0+0k 0+0io 0pf+0w
// atomic<void*>: 6656.864u 0.348s 13:56.18 796.1%        0+0k 0+0io 0pf+0w

// Different definitions of the target variable.
atomic<void*> pvTearTest;
//void* pvTearTest;

// Children sum the tears they find, and at end, total checks performed.
atomic<int> iTears( 0 );
atomic<uint64_t> iReads( 0 );

bool bEnd = false; // main thr sets true; children all finish.

void TearTest( void ) {

  uint64_t i;
  for ( i = 0; ! bEnd; i++ ) {

      intptr_t iTearTest = (intptr_t) (void*) pvTearTest;

      // Make sure top 4 and bottom 4 bytes are the same.  If not it's a tear.
      if ( ( iTearTest >> 32 ) != ( iTearTest & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %ux\n", iTearTest );
          iTears++;
      }

      // Output periodically to prove we're seeing changing values.
      if ( ( (i+1) % 50000000 ) == 0 )
          printf( "got: pv = %lx\n", iTearTest );
  }

  iReads += i;
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  tp1 = chrono::high_resolution_clock::now();

#if 0
  // Change target as fast as possible for fixed number of updates.
  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );
#else
  // More like our actual app: change target only periodically, for fixed time.
  for ( int i = 0; i < 100; i++ ) {
      pvTearTest.store( (void*) (intptr_t) ( ( i % (1L<<32) ) * 0x100000001 ),
                        std::memory_order_release );

      this_thread::sleep_for(10ms);
  }
#endif

  bEnd = true;

  for ( auto& thr: athr )
      thr.join();

  tp2 = chrono::high_resolution_clock::now();

  chrono::duration<double> dur = tp2 - tp1;
  printf( "%ld reads in %.4f secs: %.2f reads/usec\n",
          iReads.load(), dur.count(), iReads.load() / dur.count() / 1000000 );

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

Dump of assembler code for function TearTest():
   0x0000000000401256 <+0>:     push   %rbp
   0x0000000000401257 <+1>:     mov    %rsp,%rbp
   0x000000000040125a <+4>:     sub    $0x10,%rsp
   0x000000000040125e <+8>:     movq   $0x0,-0x8(%rbp)
   0x0000000000401266 <+16>:    movzbl 0x6e83(%rip),%eax        # 0x4080f0 <bEnd>
   0x000000000040126d <+23>:    test   %al,%al
   0x000000000040126f <+25>:    jne    0x40130c <TearTest()+182>
=> 0x0000000000401275 <+31>:    mov    $0x4080d8,%edi
   0x000000000040127a <+36>:    callq  0x40193a <std::atomic<void*>::operator void*() const>
   0x000000000040127f <+41>:    mov    %rax,-0x10(%rbp)
   0x0000000000401283 <+45>:    mov    -0x10(%rbp),%rax
   0x0000000000401287 <+49>:    sar    $0x20,%rax
   0x000000000040128b <+53>:    mov    -0x10(%rbp),%rdx
   0x000000000040128f <+57>:    mov    %edx,%edx
   0x0000000000401291 <+59>:    cmp    %rdx,%rax
   0x0000000000401294 <+62>:    je     0x4012bb <TearTest()+101>
   0x0000000000401296 <+64>:    mov    -0x10(%rbp),%rax
   0x000000000040129a <+68>:    mov    %rax,%rsi
   0x000000000040129d <+71>:    mov    $0x40401a,%edi
   0x00000000004012a2 <+76>:    mov    $0x0,%eax
   0x00000000004012a7 <+81>:    callq  0x401040 <printf@plt>
   0x00000000004012ac <+86>:    mov    $0x0,%esi
   0x00000000004012b1 <+91>:    mov    $0x4080e0,%edi
   0x00000000004012b6 <+96>:    callq  0x401954 <std::__atomic_base<int>::operator++(int)>
   0x00000000004012bb <+101>:   mov    -0x8(%rbp),%rax
   0x00000000004012bf <+105>:   lea    0x1(%rax),%rcx
   0x00000000004012c3 <+109>:   movabs $0xabcc77118461cefd,%rdx
   0x00000000004012cd <+119>:   mov    %rcx,%rax
   0x00000000004012d0 <+122>:   mul    %rdx
   0x00000000004012d3 <+125>:   mov    %rdx,%rax
   0x00000000004012d6 <+128>:   shr    $0x19,%rax
   0x00000000004012da <+132>:   imul   $0x2faf080,%rax,%rax
   0x00000000004012e1 <+139>:   sub    %rax,%rcx
   0x00000000004012e4 <+142>:   mov    %rcx,%rax
   0x00000000004012e7 <+145>:   test   %rax,%rax
   0x00000000004012ea <+148>:   jne    0x401302 <TearTest()+172>
   0x00000000004012ec <+150>:   mov    -0x10(%rbp),%rax
   0x00000000004012f0 <+154>:   mov    %rax,%rsi
   0x00000000004012f3 <+157>:   mov    $0x40402a,%edi
   0x00000000004012f8 <+162>:   mov    $0x0,%eax
   0x00000000004012fd <+167>:   callq  0x401040 <printf@plt>
   0x0000000000401302 <+172>:   addq   $0x1,-0x8(%rbp)
   0x0000000000401307 <+177>:   jmpq   0x401266 <TearTest()+16>
   0x000000000040130c <+182>:   mov    -0x8(%rbp),%rax
   0x0000000000401310 <+186>:   mov    %rax,%rsi
   0x0000000000401313 <+189>:   mov    $0x4080e8,%edi
   0x0000000000401318 <+194>:   callq  0x401984 <std::__atomic_base<unsigned long>::operator+=(unsigned long)>
   0x000000000040131d <+199>:   nop
   0x000000000040131e <+200>:   leaveq
   0x000000000040131f <+201>:   retq

推荐答案

是的,在x86上对齐的负载是原子的,但是,这是您应该的体系结构细节依靠!

Yes, on x86 aligned loads are atomic, BUT this is an architectural detail that you should NOT rely on!

由于您正在编写C ++代码,因此必须遵守C ++标准的规则,即必须使用原子而不是volatile.事实 volatile在介绍之前很早就已经是该语言的一部分 C ++ 11中的线程数量应该足够有力地表明volatile是 从未设计或打算用于多线程.重要的是要 请注意,在C ++中,volatilevolatile根本不同 Java或C#等语言中的语言(这些语言中的volatile 与内存模型有关的事实,因此更像是C ++中的原子.

Since you are writing C++ code, you have to abide by the rules of the C++ standard, i.e., you have to use atomics instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should be a strong enough indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# (in these languages volatile is in fact related to the memory model and therefore much more like an atomic in C++).

在C ++中,volatile用于通常称为异常内存"的内容. 这通常是可以在当前进程之外读取或修改的内存, 例如,当使用内存映射的I/O时. volatile强制编译器执行以下操作: 按照指定的确切顺序执行所有操作.这样可以防止 一些优化对于原子来说是完全合法的,同时还允许 某些实际上对原子非法的优化.例如:

In C++, volatile is used for what is often referred to as "unusual memory". This is typically memory that can be read or modified outside the current process, for example when using memory mapped I/O. volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:

volatile int x;
         int y;
volatile int z;

x = 1;
y = 2;
z = 3;
z = 4;

...

int a = x;
int b = x;
int c = y;
int d = z;

在此示例中,对z进行了两次分配,并对x进行了两次读取操作. 如果xz是原子而不是volatile,则编译器可以自由处理 第一店无关紧要,只需将其删除即可.同样,它可以重复使用 x的第一次加载返回的值,有效地生成类似int b = a的代码. 但是由于xz是易失性的,因此无法进行这些优化.反而, 编译器必须确保所有易失性操作都在 指定的确切顺序,即,可变操作不能使用 彼此尊重.但是,这不会阻止编译器重新排序 非易失性操作.例如,y上的操作可以自由移动 向上或向下-如果xz是原子,则不可能实现.所以 如果您尝试基于易失性变量实现锁,则编译器 可以简单地(合法地)将一些代码移到关键部分之外.

In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to treat the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generating code like int b = a. But since x and z are volatile, these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.

最后但并非最不重要的一点应注意,将变量标记为volatile确实可以 不能阻止它参加数据竞赛.在那些罕见的情况下 有一些异常记忆"(因此确实需要volatile),即 也可以通过多个线程访问,则必须使用易失性原子.

Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

由于对齐的负载实际上是x86上的原子负载,因此编译器会将atomic.load()调用转换为简单的mov指令,因此原子负载并不比读取volatile变量慢. atomic.store()实际上比写一个volatile变量要慢,但是有充分的理由,因为与volatile写入相反,它默认为顺序一致.您可以放松存储顺序,但是您真的必须知道您在做什么!

Since aligned loads are actually atomic on x86, the compiler will translate an atomic.load() call to a simple mov instruction, so an atomic load is not slower than reading a volatile variable. An atomic.store() is actually slower than writing a volatile variable, but for good reasons, since in contrast to the volatile write it is by default sequentially consistent. You can relax the memory orders, but you really have to know what you are doing!!

如果您想了解有关C ++内存模型的更多信息,我可以推荐这篇文章:用于C/的内存模型C ++程序员

If you want to learn more about the C++ memory model, I can recommend this paper: Memory Models for C/C++ Programmers

这篇关于现代Intel上的C ++ 11:我疯了还是非原子对齐的64位加载/存储实际上是原子的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆