超级兄弟姐妹与非超级兄弟姐妹之间的生产者/消费者共享内存位置的延迟和吞吐量成本是多少? [英] What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

查看:110
本文介绍了超级兄弟姐妹与非超级兄弟姐妹之间的生产者/消费者共享内存位置的延迟和吞吐量成本是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

单个进程中的两个不同线程可以通过读取和/或写入公用内存位置来共享.

Two different threads within a single process can share a common memory location by reading and/or writing to it.

通常,这样的(有意的)共享是使用x86上的lock前缀使用原子操作实现的,这对于lock前缀本身(即无竞争的开销)和附加的一致性都具有众所周知的成本实际共享(正确或错误共享)时的费用.

Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).

这里,我对生产消费成本感兴趣,在该生产中,单个线程P写入内存位置,而另一个线程`C从内存位置读取,都使用 plain 读取和写入.

Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.

在最近的x86内核上,在同一套接字的不同内核上执行该操作的延迟和吞吐量是多少?相比之下,在同一物理内核上的同级超线程上执行该操作的延迟和吞吐量是多少?

What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.

在标题中,我使用的术语超兄弟"是指在同一核心的两个逻辑线程上运行的两个线程,而内核间的兄弟是指在两个核心上运行的更常见的情况不同的物理核心.

In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.

推荐答案

好吧,我找不到任何权威来源,所以我想自己去尝试一下.

Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.

#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>


alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;

static inline uint64_t rdtsc()
{
    unsigned int l, h;
    __asm__ __volatile__ (
        "rdtsc"
        : "=a" (l), "=d" (h)
    );
    return ((uint64_t)h << 32) | l;
}

static void * consume(void *)
{
    uint64_t    value = 0;
    uint64_t    start = rdtsc();

    for (unsigned n = 0; n < LOOPS; ++n) {
        for (unsigned idx = 0; idx < SIZE; ++idx) {
            value += data[idx] + shared.load(std::memory_order_relaxed);
        }
    }

    elapsed = rdtsc() - start;
    return reinterpret_cast<void*>(value);
}

static void * produce(void *)
{
    do {
#ifdef EMPTY_PRODUCER
        unshared.store(0, std::memory_order_relaxed);
#else
        shared.store(0, std::memory_order_relaxed);
#enfid
    } while (!stop_producer);
    return nullptr;
}



int main()
{
    pthread_t consumerId, producerId;
    pthread_attr_t consumerAttrs, producerAttrs;
    cpu_set_t cpuset;

    for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
    shared = 0;
    stop_producer = false;

    pthread_attr_init(&consumerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(CONSUMER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);

    pthread_attr_init(&producerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(PRODUCER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);

    pthread_create(&consumerId, &consumerAttrs, consume, NULL);
    pthread_create(&producerId, &producerAttrs, produce, NULL);

    pthread_attr_destroy(&consumerAttrs);
    pthread_attr_destroy(&producerAttrs);

    pthread_join(consumerId, NULL);
    stop_producer = true;
    pthread_join(producerId, NULL);

    std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
    return 0;
}

使用以下命令进行编译,替换定义:

Compile with the following command, replacing defines:

gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing

位置:

  • CONSUMER_CPU是要在其上运行使用者线程的cpu的编号.
  • PRODUCER_CPU是要在其上运行生产者线程的cpu的编号.
  • SIZE是内部循环的大小(缓存的内容)
  • LOOPS很好...

以下是生成的循环:

消费线程

  400cc8:       ba 80 24 60 00          mov    $0x602480,%edx
  400ccd:       0f 1f 00                nopl   (%rax)
  400cd0:       8b 05 2a 17 20 00       mov    0x20172a(%rip),%eax        # 602400 <shared>
  400cd6:       48 83 c2 08             add    $0x8,%rdx
  400cda:       48 03 42 f8             add    -0x8(%rdx),%rax
  400cde:       48 01 c1                add    %rax,%rcx
  400ce1:       48 81 fa 80 24 70 00    cmp    $0x702480,%rdx
  400ce8:       75 e6                   jne    400cd0 <_ZL7consumePv+0x20>
  400cea:       83 ee 01                sub    $0x1,%esi
  400ced:       75 d9                   jne    400cc8 <_ZL7consumePv+0x18>

生产者线程,具有空循环(不写入shared):

Producer thread, with empty loop (no writing to shared):

  400c90:       c7 05 e6 16 20 00 00    movl   $0x0,0x2016e6(%rip)        # 602380 <unshared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

生产者线程,写入到shared:

Producer thread, writing to shared:

  400c90:       c7 05 66 17 20 00 00    movl   $0x0,0x201766(%rip)        # 602400 <shared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

该程序计算使用者核心上消耗的CPU周期数,以完成整个循环.我们将第一个生产者(它只消耗CPU周期而无所作为)与第二个生产者(后者通过重复写入shared来破坏使用者)进行比较.

The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.

我的系统有一个i5-4210U.即2个核心,每个核心2个线程.它们在内核中显示为Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.

My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.

根本不需要启动生产者的结果:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3          n/a           2.11G              1.80G

生产者为空的结果.用于1G操作(1000 * 1M或8000 * 128k).

Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            3.20G              3.26G       # mono
    3           2            2.10G              1.80G       # other core
    3           1            4.18G              3.24G       # same core, HT

正如预期的那样,由于两个线程都是cpu猪,并且都占有相当的份额,因此生产者的燃烧周期使消费者的速度减慢了大约一半.这只是cpu争用.

As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.

在生产者使用cpu#2的情况下,由于没有交互,因此消费者运行时不会受到在另一个cpu上运行的生产者的影响.

With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.

在cpu#1上使用生产者后,我们看到超线程正在工作.

With producer on cpu#1, we see hyperthreading at work.

破坏性制作人的结果:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            4.26G              3.24G       # mono
    3           2           22.1 G             19.2 G       # other core
    3           1           36.9 G             37.1 G       # same core, HT

  • 当我们将两个线程安排在同一内核的同一线程上时,没有任何影响.再次期望,因为生产者写的内容仍然是本地的,因此不会产生同步开销.

    • When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.

      我无法真正解释为什么超线程的性能要比两个内核差得多.欢迎咨询.

      I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.

      这篇关于超级兄弟姐妹与非超级兄弟姐妹之间的生产者/消费者共享内存位置的延迟和吞吐量成本是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆