为什么虚假共享仍然会影响非原子性，但远小于原子性呢? [英] Why does false sharing still affect non atomics, but much less than atomics?

查看：107 发布时间：2020/7/22 21:33:45 c++ x86 cpu-architecture cpu-cache false-sharing

本文介绍了为什么虚假共享仍然会影响非原子性，但远小于原子性呢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑以下示例，证明存在虚假共享:

using type = std::atomic<std::int64_t>;

struct alignas(128) shared_t
{
  type  a;
  type  b;
} sh;

struct not_shared_t
{
  alignas(128) type a;
  alignas(128) type b;
} not_sh;

一个线程将a递增1，另一个线程将b递增.即使未使用结果，增量也会使用MSVC编译为lock xadd.

对于a和b分开的结构，not_shared_t在几秒钟内累积的值大约是shared_t的十倍.

到目前为止的预期结果:单独的缓存行在L1d缓存中保持高温，增加lock xadd吞吐量的瓶颈，错误共享是对缓存行造成性能灾难. (编者注:启用优化功能后，更高版本的MSVC使用lock inc.这可能会拉大竞争与未竞争之间的差距.)

现在我将using type = std::atomic<std::int64_t>;替换为普通的std::int64_t

(非原子增量编译为inc QWORD PTR [rcx].循环中的原子负载恰巧阻止了编译器仅将计数器保存在寄存器中，直到循环退出.)

not_shared_t的已达到计数仍大于shared_t，但现在少于两倍.

|          type is          | variables are |      a=     |      b=     |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> |    shared     |   59’052’951|   59’052’951|
| std::atomic<std::int64_t> |  not_shared   |  417’814’523|  416’544’755|
|       std::int64_t        |    shared     |  949’827’195|  917’110’420|
|       std::int64_t        |  not_shared   |1’440’054’733|1’439’309’339|

为什么非原子案例的性能如此接近?

以下是程序的其余部分，以完成最小的可重现示例. (也在Godbolt用MSVC 的，准备编译/运行)

std::atomic<bool> start, stop;

void thd(type* var)
{
  while (!start) ;
  while (!stop) (*var)++;
}

int main()
{
  std::thread threads[] = {
     std::thread( thd, &sh.a ),     std::thread( thd, &sh.b ),
     std::thread( thd, &not_sh.a ), std::thread( thd, &not_sh.b ),
  };

  start.store(true);

  std::this_thread::sleep_for(std::chrono::seconds(2));

  stop.store(true);
  for (auto& thd : threads) thd.join();

  std::cout
    << " shared: "    << sh.a     << ' ' << sh.b     << '\n'
    << "not shared: " << not_sh.a << ' ' << not_sh.b << '\n';
}

解决方案

非原子内存增量可以在重新加载自身的存储值时从存储转发中受益.即使高速缓存行无效，也可能发生这种情况.核心知道存储将最终发生，并且内存排序规则使该核心可以在全局可见之前看到自己的存储.

存储转发为您提供了停顿前存储缓冲区的长度增量，而不是对于类似这样的另一个实验，生产者-消费者在超兄弟姐妹与非超级兄弟姐妹之间共享存储位置的延迟和吞吐成本是多少?

A lock inc或lock xadd强制存储缓冲区在操作之前耗尽，并且包括作为操作的一部分提交到L1d高速缓存.这使得商店转发变得不可能，并且只有在缓存行处于独占"或修改后" MESI状态时才会发生.

相关:

英特尔硬件上的存储缓冲区大小?究竟什么是存储缓冲区?
可以现代吗x86实现是从一个以上的先前存储中进行存储转发?(否，但是那里的详细信息可能会帮助您准确了解存储缓冲区的功能以及在这种情况下(重新加载与存储完全重叠的情况)存储转发的工作方式. )

Consider the following example that proves false sharing existence:

using type = std::atomic<std::int64_t>;

struct alignas(128) shared_t
{
  type  a;
  type  b;
} sh;

struct not_shared_t
{
  alignas(128) type a;
  alignas(128) type b;
} not_sh;

One thread increments a by steps of 1, another thread increments b. Increments compile to lock xadd with MSVC, even though the result is unused.

For a structure where a and b are separated, the values accumulated in a few seconds is about ten times greater for not_shared_t than for shared_t.

So far expected result: separate cache lines stay hot in L1d cache, increment bottlenecks on lock xadd throughput, false sharing is a performance disaster ping-ponging the cache line. (editor's note: later MSVC versions use lock inc when optimization is enabled. This may widen the gap between contended vs. uncontended.)

Now I'm replacing using type = std::atomic<std::int64_t>; with plain std::int64_t

(The non-atomic increment compiles to inc QWORD PTR [rcx]. The atomic load in the loop happens to stop the compiler from just keeping the counter in a register until loop exit.)

The reached count for not_shared_t is still greater than for shared_t, but now less than twice.

|          type is          | variables are |      a=     |      b=     |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> |    shared     |   59’052’951|   59’052’951|
| std::atomic<std::int64_t> |  not_shared   |  417’814’523|  416’544’755|
|       std::int64_t        |    shared     |  949’827’195|  917’110’420|
|       std::int64_t        |  not_shared   |1’440’054’733|1’439’309’339|

Why is the non-atomic case so much closer in performance?

Here is the rest of the program to complete the minimum reproducible example. (Also On Godbolt with MSVC, ready to compile/run)

std::atomic<bool> start, stop;

void thd(type* var)
{
  while (!start) ;
  while (!stop) (*var)++;
}

int main()
{
  std::thread threads[] = {
     std::thread( thd, &sh.a ),     std::thread( thd, &sh.b ),
     std::thread( thd, &not_sh.a ), std::thread( thd, &not_sh.b ),
  };

  start.store(true);

  std::this_thread::sleep_for(std::chrono::seconds(2));

  stop.store(true);
  for (auto& thd : threads) thd.join();

  std::cout
    << " shared: "    << sh.a     << ' ' << sh.b     << '\n'
    << "not shared: " << not_sh.a << ' ' << not_sh.b << '\n';
}

解决方案

Non-atomic memory-increments can benefit from store-forwarding when reloading its own stored value. This can happen even while the cache line is invalid. The core knows that the store will happen eventually, and the memory-ordering rules allow this core to see its own stores before they become globally visible.

Store-forwarding gives you the length of the store buffer number of increments before you stall, instead of needing exclusive access to the cache line to do an atomic RMW increment.

When this core does eventually gain ownership of the cache line, it can commit multiple stores at 1/clock. This is 6x faster than the dependency chain created by a memory-destination increment: ~5 cycle store/reload latency + 1 cycle ALU latency. So execution is only putting new stores into the SB at 1/6th the rate it can drain while a core owns it, in the non-atomic case This is why there isn't a huge gap between shared vs. non-shared atomic.

There's certainly going to be some memory ordering machine clears, too; that and/or SB full are the likely reasons for lower throughput in the false sharing case. See answers and comments on What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for another experiment somewhat like this one.

A lock inc or lock xadd forces the store buffer to drain before the operation, and includes committing to L1d cache as part of the operation. This makes store forwarding impossible, and can only happen when the cache line is owned in Exclusive or Modified MESI states.

Size of store buffers on Intel hardware? What exactly is a store buffer?
Can modern x86 implementations store-forward from more than one prior store? (no, but the details there may help you understand exactly what store buffers do and how store-forwarding works for this case where the reload exactly overlaps with the store.)

这篇关于为什么虚假共享仍然会影响非原子性，但远小于原子性呢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么虚假共享仍然会影响非原子性，但远小于原子性呢? [英] Why does false sharing still affect non atomics, but much less than atomics?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

为什么虚假共享仍然会影响非原子性，但远小于原子性呢? [英] Why does false sharing still affect non atomics, but much less than atomics?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭