理解std :: hardware_destructive_interference_size和std :: hardware_constructive_interference_size [英] Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size

查看:544
本文介绍了理解std :: hardware_destructive_interference_size和std :: hardware_constructive_interference_size的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C ++ 17添加了 std :: hardware_destructive_interference_size std :: hardware_constructive_interference_size 。首先,我认为这只是一种获取L1缓存行大小的便携式方法,但这是一个过度简化。



问题:




  • 这些常量与L1缓存行大小有什么关系?

  • 有没有一个很好的例子来演示他们的用例? li>
  • 都定义 static constexpr 。这是不是一个问题,如果你构建一个二进制文件,并在具有不同缓存行大小的其他机器上执行它?在您不确定代码将在哪台计算机上运行时,如何防止在这种情况下出现假共享?


解决方案

这些常量的目的是获取缓存行大小。阅读提案本身的理由的最佳地点是:



http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html



我会在这里引用一段基本原理,以方便阅读:


[...]不干扰(一阶)的内存粒度通常被称为缓存行大小

b
$ b

使用缓存行大小可分为两大类:




  • 避免在具有来自不同线程的时间上不相交的运行时访问模式的对象之间的破坏性干扰(假共享)。

  • 在具有时间本地运行时访问模式的对象之间促进建设性干扰(真实共享)。



这个有用的实现数量的最大问题是当前实践中使用的方法的可疑可移植性来确定它的值,普遍性和受欢迎程度作为一个群体。 [...]



我们的目标是为这个原因贡献一个温和的发明,这个数量的抽象可以通过实现保守地定义给定的目的:




  • 破坏性干扰大小:适合作为两个对象之间的偏移量,可能避免由于不同的运行时访问
  • :构造性干扰大小:适合作为两个对象的组合内存占用大小和基准对齐限制的数字,

    在这两种情况下,这些值都是基于实施质量提供的,纯粹是可能提高性能的提示。这些是与 alignas()关键字一起使用的理想的便携式值,目前几乎没有标准支持的便携式用途。


    < blockquote>




    这些常数与L1缓存行大小有什么关系



    在理论上,可以直接。



    假设编译器知道你将运行什么架构 - 精确地给出L1缓存行大小。 (如下所述,这是一个很大的假设。)



    对于值得的,我几乎总是期望这些值是相同的。我相信他们被单独宣布的唯一原因是完整性。 (也就是说,编译器可能需要估计L2缓存行大小,而不是L1缓存行大小的建设性干扰;但我不知道这是否真的有用。)






    有很好的例子来演示他们的用例吗?



    在这个答案的底部我附上一个长基准程序,演示假共享和真正的共享。



    它通过分配一个数组演示假共享的int包装器:在一种情况下,多个元素适合在L1高速缓存线中,并且在另一个单元素占据L1高速缓存线。在紧密循环中,从数组中选择一个固定的元素,并重复更新。



    它通过在包装器中分配一对int来演示真正的共享:在一种情况下,该对内的两个int不匹配在L1高速缓存线大小在一起,而在其他他们做。



    请注意,用于访问被测对象的代码不会 改变;唯一的区别是对象本身的布局和对齐。



    我没有C ++ 17编译器(并且假设大多数人目前都没有) ,所以我已经替换了我自己的常量。您需要在计算机上更新这些值以使其准确。也就是说,64字节可能是典型的现代桌面硬件(在撰写本文时)的正确值。



    警告:测试将使用所有核心你的机器,并分配〜256MB的内存。不要忘记使用优化进行编译!



    在我的机器上,输出为:

     
    硬件并发:16
    sizeof(naive_int):4
    alignof(naive_int):4
    sizeof(cache_int):64
    alignof :64
    sizeof(bad_pair):72
    alignof(bad_pair):4
    sizeof(good_pair):8
    alignof(good_pair):4
    运行naive_int测试。
    平均时间:0.0873625秒,无用结果:3291773
    运行cache_int测试。
    平均时间:0.024724秒,无用结果:3286020
    运行bad_pair测试。
    平均时间:0.308667秒,无用结果:6396272
    运行good_pair测试。
    平均时间:0.174936秒,无用的结果:6668457



    我通过避免假共享获得〜3.5x的加速,






    这两个都是定义的静态constexpr。这不是一个问题如果你构建一个二进制文件,并在具有不同缓存行大小的其他机器上执行它?当你不确定你的代码将运行在哪台机器上时,如何防止在这种情况下的假共享?



    这将是一个问题。这些常量不能保证映射到目标机器上的任何高速缓存行大小,特别是它们是编译器可以调用的最佳近似值。



    在提案中注明,在附录中,它们给出了一些示例,说明一些库如何根据各种环境提示和宏在编译时尝试检测高速缓存行大小。您 保证此值至少 alignof(max_align_t),这是一个明显的下限。



    换句话说,这个值应该作为你的回退情况;你可以自由地定义一个精确的值,如果你知道它,例如:

      constexpr std :: size_t cache_line_size(){
    #ifdef KNOWN_L1_CACHE_LINE_SIZE
    return KNOWN_L1_CACHE_LINE_SIZE;
    #else
    返回std :: hardware_destructive_interference_size;
    #endif
    }

    在编译期间,如果要假设高速缓存-line size只需定义 KNOWN_L1_CACHE_LINE_SIZE



    希望这有助于!



    基准程式:

      #include< chrono> 
    #include< condition_variable>
    #include< cstddef>
    #include< functional>
    #include< future>
    #include< iostream>
    #include< random>
    #include< thread>
    #include< vector>

    //!你必须更新这是准确的!
    constexpr std :: size_t hardware_destructive_interference_size = 64;

    //!你必须更新这是准确的!
    constexpr std :: size_t hardware_constructive_interference_size = 64;

    constexpr unsigned kTimingTrialsToComputeAverage = 100;
    constexpr unsigned kInnerLoopTrials = 1000000;

    typedef unsigned useless_result_t;
    typedef double elapsed_secs_t;

    ////////要采样的代码:

    //包装一个int,默认对齐允许假共享
    struct naive_int {
    int value;
    };
    static_assert(alignof(naive_int)< hardware_destructive_interference_size,;);

    // wrap一个int,缓存对齐防止false-sharing
    struct cache_int {
    alignas(hardware_destructive_interference_size)int value;
    };
    static_assert(alignof(cache_int)== hardware_destructive_interference_size,);

    //包装一对int,有意地将它们分开太远才能真正分享
    struct bad_pair {
    int first;
    char padding [hardware_constructive_interference_size];
    int second;
    };
    static_assert(sizeof(bad_pair)> hardware_constructive_interference_size,);

    //封装一对int,确保它们非常适合真正的共享
    struct good_pair {
    int first;
    int second;
    };
    static_assert(sizeof(good_pair)" = hardware_constructive_interference_size,);

    //多次访问特定的数组元素
    template< typename T,typename Latch>
    useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec){
    //为计算做准备
    std :: random_device rd;
    std :: mt19937 mt {rd()};
    std :: uniform_int_distribution< int> dist {0,4096};

    auto& element = vec [vec.size()/ 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for(unsigned trial = 0; trial!= kInnerLoopTrials; ++ trial){
    element.value = dist(mt);
    }

    return static_cast< useless_result_t>(element.value);
    }

    //访问对的元素多次
    template< typename T,typename Latch>
    useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair){
    //为计算做准备
    std :: random_device rd;
    std :: mt19937 mt {rd()};
    std :: uniform_int_distribution< int> dist {0,4096};

    latch.count_down_and_wait();

    // compute
    for(unsigned trial = 0; trial!= kInnerLoopTrials; ++ trial){
    pair.first = dist(mt);
    pair.second = dist(mt);
    }

    return static_cast< useless_result_t>(pair.first)+
    static_cast< useless_result_t>(pair.second);
    }

    ////////实用程序:

    //实用程序:允许线程等待,直到每个人都准备好
    class threadlatch {
    public:
    显式threadlatch(const std :: size_t count):
    count_ {count}
    {}

    void count_down_and_wait(){
    std :: unique_lock< std :: mutex> lock {mutex_};
    if(--count_ == 0){
    cv_.notify_all();
    }
    else {
    cv_.wait(lock,[&] {return count_ == 0;});
    }
    }

    private:
    std :: mutex mutex_;
    std :: condition_variable cv_;
    std :: size_t count_;
    };

    //实用程序:在N个线程中运行给定的函数
    std :: tuple< useless_result_t,elapsed_secs_t> run_threads(
    const std :: function< useless_result_t(threadlatch& unsigned)>& func,
    const unsigned num_threads){
    threadlatch latch {num_threads + 1};

    std :: vector< std :: future< unsigned>>期货;
    std :: vector< std :: thread>线程;
    for(unsigned thread_index = 0; thread_index!= num_threads; ++ thread_index){
    std :: packaged_task< unsigned()>任务{
    std :: bind(func,std :: ref(latch),thread_index)
    };

    futures.push_back(task.get_future());
    threads.push_back(std :: thread(std :: move(task)));
    }

    const auto starttime = std :: chrono :: high_resolution_clock :: now();

    latch.count_down_and_wait();
    for(auto& thread:threads){
    thread.join();
    }

    const auto endtime = std :: chrono :: high_resolution_clock :: now();
    const auto elapsed = std :: chrono :: duration_cast<
    std :: chrono :: duration< double>>(
    endtime - starttime
    ).count();

    unsigned result = 0;
    for(auto& future:futures){
    result + = future.get();
    }

    return std :: make_tuple(result,elapsed);
    }

    //实用程序:在N个线程上运行func所需要的时间
    void run_tests(
    const std :: function< useless_result_t(threadlatch& unsigned)>& func,
    const unsigned num_threads){
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for(unsigned trial = 0; trial!= kTimingTrialsToComputeAverage; ++ trial){
    const auto result_and_elapsed = run_threads(func,num_threads);
    const auto result = std :: get< useless_result_t>(result_and_elapsed);
    const auto elapsed = std :: get< elapsed_secs_t>(result_and_elapsed);

    final_result + = result;
    avgtime =(avgtime * trial + elapsed)/(trial + 1);
    }

    std :: cout
    << 平均时间:< avgtime
    << 秒,无用结果:< final_result
    << std :: endl;
    }

    int main(){
    const auto cores = std :: thread :: hardware_concurrency();
    std :: cout<< Hardware concurrency:<核心< std :: endl;

    std :: cout<< sizeof(naive_int):< sizeof(naive_int)<< std :: endl;
    std :: cout<< alignof(naive_int):< alignof(naive_int)<< std :: endl;
    std :: cout<< sizeof(cache_int):< sizeof(cache_int)< std :: endl;
    std :: cout<< alignof(cache_int):< alignof(cache_int)<< std :: endl;
    std :: cout<< sizeof(bad_pair):< sizeof(bad_pair)<< std :: endl;
    std :: cout<< alignof(bad_pair):< alignof(bad_pair)<< std :: endl;
    std :: cout<< sizeof(good_pair):< sizeof(good_pair)<< std :: endl;
    std :: cout<< alignof(good_pair):<< alignof(good_pair)<< std :: endl;

    {
    std :: cout<< 运行naive_int测试。 << std :: endl;

    std :: vector< naive_int> vec;
    vec.resize((1u<< 28)/ sizeof(naive_int)); // assign 256 mibibytes

    run_tests([&](threadlatch& latch,unsigned thread_index){
    return sample_array_threadfunc(latch,thread_index,vec);
    } ;
    }
    {
    std :: cout< 运行cache_int测试。 << std :: endl;

    std :: vector< cache_int> vec;
    vec.resize((1u << 28)/ sizeof(cache_int)); // assign 256 mibibytes

    run_tests([&](threadlatch& latch,unsigned thread_index){
    return sample_array_threadfunc(latch,thread_index,vec);
    } ;
    }
    {
    std :: cout< 运行bad_pair测试。 << std :: endl;

    bad_pair p;

    run_tests([&](threadlatch& latch,unsigned thread_index){
    return sample_pair_threadfunc(latch,thread_index,p);
    },cores);
    }
    {
    std :: cout< 运行good_pair测试。 << std :: endl;

    good_pair p;

    run_tests([&](threadlatch& latch,unsigned thread_index){
    return sample_pair_threadfunc(latch,thread_index,p);
    },cores);
    }
    }


    C++17 added std::hardware_destructive_interference_size and std::hardware_constructive_interference_size. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.

    Questions:

    • How are these constants related to the L1 cache line size?
    • Is there a good example that demonstrates their use cases?
    • Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?

    解决方案

    The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:

    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

    I'll quote a snippet of the rationale here for ease-of-reading:

    [...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.

    Uses of cache-line size fall into two broad categories:

    • Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
    • Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.

    The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]

    We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:

    • Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
    • Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.

    In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.


    "How are these constants related to the L1 cache line size?"

    In theory, pretty directly.

    Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)

    For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)


    "Is there a good example that demonstrates their use cases?"

    At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.

    It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.

    It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.

    Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.

    I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).

    Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!

    On my machine, the output is:

    Hardware concurrency: 16
    sizeof(naive_int): 4
    alignof(naive_int): 4
    sizeof(cache_int): 64
    alignof(cache_int): 64
    sizeof(bad_pair): 72
    alignof(bad_pair): 4
    sizeof(good_pair): 8
    alignof(good_pair): 4
    Running naive_int test.
    Average time: 0.0873625 seconds, useless result: 3291773
    Running cache_int test.
    Average time: 0.024724 seconds, useless result: 3286020
    Running bad_pair test.
    Average time: 0.308667 seconds, useless result: 6396272
    Running good_pair test.
    Average time: 0.174936 seconds, useless result: 6668457
    

    I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.


    "Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"

    This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.

    This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.

    In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:

    constexpr std::size_t cache_line_size() {
    #ifdef KNOWN_L1_CACHE_LINE_SIZE
      return KNOWN_L1_CACHE_LINE_SIZE;
    #else
      return std::hardware_destructive_interference_size;
    #endif
    }
    

    During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.

    Hope this helps!

    Benchmark program:

    #include <chrono>
    #include <condition_variable>
    #include <cstddef>
    #include <functional>
    #include <future>
    #include <iostream>
    #include <random>
    #include <thread>
    #include <vector>
    
    // !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
    constexpr std::size_t hardware_destructive_interference_size = 64;
    
    // !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
    constexpr std::size_t hardware_constructive_interference_size = 64;
    
    constexpr unsigned kTimingTrialsToComputeAverage = 100;
    constexpr unsigned kInnerLoopTrials = 1000000;
    
    typedef unsigned useless_result_t;
    typedef double elapsed_secs_t;
    
    //////// CODE TO BE SAMPLED:
    
    // wraps an int, default alignment allows false-sharing
    struct naive_int {
        int value;
    };
    static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");
    
    // wraps an int, cache alignment prevents false-sharing
    struct cache_int {
        alignas(hardware_destructive_interference_size) int value;
    };
    static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");
    
    // wraps a pair of int, purposefully pushes them too far apart for true-sharing
    struct bad_pair {
        int first;
        char padding[hardware_constructive_interference_size];
        int second;
    };
    static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");
    
    // wraps a pair of int, ensures they fit nicely together for true-sharing
    struct good_pair {
        int first;
        int second;
    };
    static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");
    
    // accesses a specific array element many times
    template <typename T, typename Latch>
    useless_result_t sample_array_threadfunc(
        Latch& latch,
        unsigned thread_index,
        T& vec) {
        // prepare for computation
        std::random_device rd;
        std::mt19937 mt{ rd() };
        std::uniform_int_distribution<int> dist{ 0, 4096 };
    
        auto& element = vec[vec.size() / 2 + thread_index];
    
        latch.count_down_and_wait();
    
        // compute
        for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
            element.value = dist(mt);
        }
    
        return static_cast<useless_result_t>(element.value);
    }
    
    // accesses a pair's elements many times
    template <typename T, typename Latch>
    useless_result_t sample_pair_threadfunc(
        Latch& latch,
        unsigned thread_index,
        T& pair) {
        // prepare for computation
        std::random_device rd;
        std::mt19937 mt{ rd() };
        std::uniform_int_distribution<int> dist{ 0, 4096 };
    
        latch.count_down_and_wait();
    
        // compute
        for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
            pair.first = dist(mt);
            pair.second = dist(mt);
        }
    
        return static_cast<useless_result_t>(pair.first) +
            static_cast<useless_result_t>(pair.second);
    }
    
    //////// UTILITIES:
    
    // utility: allow threads to wait until everyone is ready
    class threadlatch {
    public:
        explicit threadlatch(const std::size_t count) :
            count_{ count }
        {}
    
        void count_down_and_wait() {
            std::unique_lock<std::mutex> lock{ mutex_ };
            if (--count_ == 0) {
                cv_.notify_all();
            }
            else {
                cv_.wait(lock, [&] { return count_ == 0; });
            }
        }
    
    private:
        std::mutex mutex_;
        std::condition_variable cv_;
        std::size_t count_;
    };
    
    // utility: runs a given function in N threads
    std::tuple<useless_result_t, elapsed_secs_t> run_threads(
        const std::function<useless_result_t(threadlatch&, unsigned)>& func,
        const unsigned num_threads) {
        threadlatch latch{ num_threads + 1 };
    
        std::vector<std::future<unsigned>> futures;
        std::vector<std::thread> threads;
        for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
            std::packaged_task<unsigned()> task{
                std::bind(func, std::ref(latch), thread_index)
            };
    
            futures.push_back(task.get_future());
            threads.push_back(std::thread(std::move(task)));
        }
    
        const auto starttime = std::chrono::high_resolution_clock::now();
    
        latch.count_down_and_wait();
        for (auto& thread : threads) {
            thread.join();
        }
    
        const auto endtime = std::chrono::high_resolution_clock::now();
        const auto elapsed = std::chrono::duration_cast<
            std::chrono::duration<double>>(
                endtime - starttime
                ).count();
    
        unsigned result = 0;
        for (auto& future : futures) {
            result += future.get();
        }
    
        return std::make_tuple(result, elapsed);
    }
    
    // utility: sample the time it takes to run func on N threads
    void run_tests(
        const std::function<useless_result_t(threadlatch&, unsigned)>& func,
        const unsigned num_threads) {
        useless_result_t final_result = 0;
        double avgtime = 0.0;
        for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
            const auto result_and_elapsed = run_threads(func, num_threads);
            const auto result = std::get<useless_result_t>(result_and_elapsed);
            const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);
    
            final_result += result;
            avgtime = (avgtime * trial + elapsed) / (trial + 1);
        }
    
        std::cout
            << "Average time: " << avgtime
            << " seconds, useless result: " << final_result
            << std::endl;
    }
    
    int main() {
        const auto cores = std::thread::hardware_concurrency();
        std::cout << "Hardware concurrency: " << cores << std::endl;
    
        std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
        std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
        std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
        std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
        std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
        std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
        std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
        std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;
    
        {
            std::cout << "Running naive_int test." << std::endl;
    
            std::vector<naive_int> vec;
            vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes
    
            run_tests([&](threadlatch& latch, unsigned thread_index) {
                return sample_array_threadfunc(latch, thread_index, vec);
            }, cores);
        }
        {
            std::cout << "Running cache_int test." << std::endl;
    
            std::vector<cache_int> vec;
            vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes
    
            run_tests([&](threadlatch& latch, unsigned thread_index) {
                return sample_array_threadfunc(latch, thread_index, vec);
            }, cores);
        }
        {
            std::cout << "Running bad_pair test." << std::endl;
    
            bad_pair p;
    
            run_tests([&](threadlatch& latch, unsigned thread_index) {
                return sample_pair_threadfunc(latch, thread_index, p);
            }, cores);
        }
        {
            std::cout << "Running good_pair test." << std::endl;
    
            good_pair p;
    
            run_tests([&](threadlatch& latch, unsigned thread_index) {
                return sample_pair_threadfunc(latch, thread_index, p);
            }, cores);
        }
    }
    

    这篇关于理解std :: hardware_destructive_interference_size和std :: hardware_constructive_interference_size的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆