为什么memory_order_relaxed性能与memory_order_seq_cst相同 [英] why memory_order_relaxed performance is the same as memory_order_seq_cst

查看:80
本文介绍了为什么memory_order_relaxed性能与memory_order_seq_cst相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个简单的测试,以检查 std :: memory_order_relaxed 的运行速度是否比 atomic< int> std :: memory_order_seq_cst 值快>递增.但是,两种情况下的性能都是相同的.
我的编译器:gcc版本7.3.0(Ubuntu 7.3.0-27ubuntu1〜18.04)
构建参数:g ++ -m64 -O3 main.cpp -std = c ++ 17 -lpthread
CPU:Intel(R)CoreTM i7-2670QM CPU @ 2.20GHz,4核,每核2个线程
测试代码:

I've created a simple test to check how std::memory_order_relaxed is faster than std::memory_order_seq_cst value for atomic<int> increment. However the performance was the same for both cases.
My compiler: gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
Build arguments: g++ -m64 -O3 main.cpp -std=c++17 -lpthread
CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz, 4 core, 2 thread per core
Test code:

#include <vector>
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
#include <functional>

std::atomic<int> cnt = {0};

void run_test_order_relaxed()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    std::cout << "rel: " << cnt.load(std::memory_order_relaxed);
    for (auto& t : v)
        t.join();
    }

void run_test_order_cst()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 4; ++n) {
        v.emplace_back([]() {
            for (int n = 0; n < 30000000; ++n) {
                cnt.fetch_add(1, std::memory_order_seq_cst);
            }
        });
    }
    std::cout << "cst: " << cnt.load(std::memory_order_seq_cst);
    for (auto& t : v)
        t.join();
}

void measure_duration(const std::function<void()>& func)
{
    using namespace std::chrono;
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    func();
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
    std::cout << " duration: " << duration << "ms" << std::endl;
}

int main()
{
    measure_duration(&run_test_order_relaxed);
    measure_duration(&run_test_order_cst); 
    return 0;
}

为什么 std :: memory_order_relaxed std :: memory_order_seq_cst 总是产生几乎相同的结果?
结果:
rel:2411持续时间:4440ms
cst:120000164持续时间:4443ms

Why does std::memory_order_relaxed and std::memory_order_seq_cst always produce almost the same results?
Result:
rel: 2411 duration: 4440ms
cst: 120000164 duration: 4443ms

推荐答案

无论内存顺序设置如何,您都需要在两个循环中进行原子操作.事实证明,对于在大多数情况下固有地有序排列的x86处理器,这导致对每个fetch_add使用相同的asm代码: lock xadd .x86处理器上的这种原子操作始终是顺序一致的,因此在指定宽松的内存顺序时这里没有优化机会.

Regardless of the memory order setting, you are requiring an atomic operation in both loops. It turns out that, with x86 processors, which are inherently strongly ordered in most situations, this results in using the same asm codes for each fetch_add: lock xadd. This atomic operation on x86 processors is always sequentially consistent, so there are no optimization opportunities here when specifying relaxed memory order.

使用宽松的内存顺序可以进一步优化周围的操作,但是您的代码没有提供任何进一步的优化机会,因此发出的代码是相同的.请注意,使用弱排序的处理器(例如ARM)或在循环内进行更多的数据操作(可能提供更多的重新排序机会)时,结果可能会有所不同.

Using relaxed memory order allows further optimizations of surrounding operations, but your code doesn't provide any further optimization opportunities, so the emitted code is the same. Note that the results may have been different with a weakly-ordered processor (e.g., ARM) or with more data manipulation within the loop (which could offer more reordering opportunities).

来自 cppreference (我的斜体字):

std :: memory_order指定在原子操作周围 对常规的非原子内存访问进行排序的方式.

std::memory_order specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation.

论文 C/C ++程序员的内存模型提供了许多对此有更详细的说明.

The paper Memory Models for C/C++ Programmers provides much greater detail on this.

请注意,重复运行原子基准测试或在不同的x86处理器上运行原子基准测试(甚至由同一制造商)可能会产生截然不同的结果,因为线程可能无法平等地分布在所有内核上,并且缓存延迟是受本地内核,同一芯片上的另一个内核还是另一个芯片上的内核的影响.它也受特定处理器如何处理潜在的一致性冲突的影响.此外,级别1,级别2和级别3缓存的行为与ram一样,因此,数据集的总大小也会产生重大影响.参见评估原子能技术的成本现代建筑.

As a side note, repeatedly running atomic benchmarks or running them on different x86 processors (even by the same manufacturer) may result in dramatically different results, as the threads might not be distributed across all the cores equally, and cache latencies are affected by whether it is a local core, another core on the same chip, or on another chip. It's also affected by how the particular processor handles potential consistency conflicts. Furthermore, level 1, 2 and 3 caches behave differently, as does ram, so total size of the data set also has significant effects. See Evaluating the Cost of Atomic Operations on Modern Architectures.

这篇关于为什么memory_order_relaxed性能与memory_order_seq_cst相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆