一个线程计数,另一线程执行工作和测量 [英] One thread counting, other thread does a job and measurement

查看:127
本文介绍了一个线程计数,另一线程执行工作和测量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现一个2线程模型,其中1正在计数(无限增加一个值),另一个正在记录第一个计数器,执行此工作,记录第二个记录并测量其间的时间。



这是我到目前为止所做的:

  //全球计数器
寄存器无符号长计数器asm( r13);
//无符号长计数器;

void * counter_thread(){
//将亲和力设置为某些隔离的CPU,因此噪声将降至最小

while(1){
// counter ++; //第1行*
asm volatile( add $ 1,%0: + r(counter):); //第2行*
}
}

void * measurement_thread(){
//相似性在此处设置为
unsigned long meas = 0 ;
无符号long a = 5;
无符号长r1,r2;
sleep(1.0);
while(1){
mfence();
r1 =计数器;
a * = 3; //我要测量的虚拟操作
r2 =计数器;
mfence();
meas = r2-r1;
printf( counter:%ld \n,counter);
休息时间;
}
}

让我解释一下到目前为止我做了什么:



由于我希望计数器是准确的,因此我将亲和性设置为隔离的CPU。另外,如果我在第1行*中使用计数器,那么反作用函数将是:

  d4c:4c 89 e8 mov%r13 ,%rax 
d4f:48 83 c0 01加$ 0x1,%rax
d53:49 89 c5 mov%rax,%r13
d56:eb f4 jmp d4c< counter_thread + 0x37>

不是1个周期的操作。这就是为什么我使用内联汇编来减少2条mov指令的原因。使用内联汇编:

  d4c:49 83 c5 01添加$ 0x1,%r13 
d50:eb fa jmp d4c< counter_thread + 0x37>

但问题是,这两种实现方式均不起作用。另一个线程看不到计数器正在更新。如果我使全局计数器值不是寄存器,则它正在工作,但是我想保持精确。如果我将全局计数器值设为 unsigned long counter ,则计数器线程的反汇编代码为:

  d4c:48 8b 05 ed 12 20 00 mov 0x2012ed(%rip),%rax#202040< counter> 
d53:48 83 c0 01加$ 0x1,%rax
d57:48 89 05 e2 12 20 00 mov%rax,0x2012e2(%rip)#202040< counter>
d5e:eb ec jmp d4c< counter_thread + 0x37>

它可以工作,但没有给我想要的粒度。



编辑



我的环境:




  • CPU:AMD Ryzen 3600

  • 内核:5.0.0-32-通用

  • 操作系统:Ubuntu 18.04



EDIT2 :我隔离了2个相邻的CPU核心(即核心10和11),并在这些核心上进行了实验核心。计数器在一个核上,测量在另一个核上。隔离是通过使用/ etc / default / grub文件并添加isolcpus行来完成的。



EDIT3 :我知道一次测量是不够的。我已经进行了1000万次实验,并查看了结果。



Experiment1
设置:

  unsigned long counter = 0; ///全局计数器
void * counter_thread(){
mfence();
while(1)
counter ++;
}
void * measurement_thread(){
unsigned long i = 0,r1 = 0,r2 = 0;
unsigned int a = 0;
sleep(1.0);
while(1){
mfence();
r1 =计数器;
a + = 3;
r2 =计数器;
mfence();
测量[r2-r1] ++;
i ++;
if(i == MILLION_ITER)
休息;
}
}

结果1
在99.99%中,我得到0。我期望这是因为第一个线程未运行,或者OS或其他中断干扰了测量。摆脱0和非常高的值,我平均可以获得20个测量周期。 (我期望3-4,因为我只做整数加法运算。)



Experiment2



设置:与上面相同,一个区别是,我使用计数器作为寄存器,而不是全局计数器:

 注册无符号长计数器asm( r13); 

Results2 :测量线程始终读取0。在反汇编代码中,我可以看到它们都在处理R13寄存器(计数器),但是,我认为它不是以某种方式共享的。



Experiment3



设置:与setup2相同,除了在计数器线程中,不是执行counter ++,而是执行内联汇编以确保执行1周期操作。我的反汇编文件如下所示:

  cd1:49 83 c5 01添加$ 0x1,%r13 
cd5:eb fa jmp cd1< counter_thread + 0x37>

结果3 :测量线程读取的读数如上所述为0。

解决方案

每个线程都有自己的寄存器。每个逻辑CPU内核都有自己的体系结构寄存器,线程在该线程上运行时会使用核心。只有信号处理程序(或裸机上的中断)才能修改其线程的寄存器。



声明 GNU C asm全局寄存器就像您的 ... asm( r13)在多线程程序中有效地为您提供线程本地存储,而不是真正的全局共享。



只有线程之间共享内存,而不是寄存器。这就是多个线程可以同时运行而彼此之间不会彼此踩踏的方式。



不要的寄存器声明为register-global可以由编译器自由使用,因此它们不能在内核之间共享。 (根据您声明它们的方式,GCC不能做任何使它们共享或私有的操作。)



除此之外,全局寄存器不是 volatile atomic ,因此 r1 =计数器; r2 =计数器; 可以进行CSE,所以即使您的本地R13从信号更改, r2-r1 的编译时间常数也为零







如何确保两个线程都使用寄存器进行读取/计数器值的写操作?


您不能这样做。没有共享状态可以以比缓存低的延迟读取/写入的内核之间。



如果要计时,请考虑使用 rdtsc 获取参考周期,或 rdpmc 读取性能计数器(您可能已设置为计算核心块k个周期)。



没有必要使用另一个线程来增加计数器,并且没有帮助,因为没有开销很低的读取方式







我机器中的rdtscp指令给出了36-72-108最好是循环分辨率。因此,我无法区分2个周期和35个周期之间的差异,因为它们两个都将给出36个周期。


然后,您正在使用 rdtsc 错误。它没有序列化,因此您需要在定时区域周围 fence 。在上查看我的答案如何从C ++获取x86_64中的CPU周期计数?。但是,是的, rdtsc 昂贵,而 rdpmc 的开销却较低。



但更重要的是,您无法有效地以周期成本来衡量 a * = 3; 的C值。首先,它可以根据上下文进行不同的编译。



但是假设正常的 lea eax,[rax + rax * 2] 一个实际的指令成本模型具有3个维度:从输入到输出的uop计数(前端),后端端口压力和延迟 https://agner.org/optimize/



请参见> NASM中的RDTSCP总是返回相同值的答案有关计时单个指令的更多信息。将其以不同的方式放在一个循环中,以测量吞吐量与延迟的关系,并查看性能计数器以获取uops-> ports。或查看Agner Fog的说明表和 https://uops.info/ ,因为人们已经进行了这些测试。 / p>





同样,这些是您为单个asm指令计时的方式,不是C语句。启用优化后,C语句的成本可能取决于其如何优化周围的代码。 (和/或在像所有现代x86 CPU一样的无序执行CPU上,周围操作的延迟是否掩盖了其成本。)


I would like to implement a 2 thread model where 1 is counting (infinitely increment a value) and the other one is recording the first counter, do the job, record the second recording and measure the time elapsed between.

Here is what I have done so far:

// global counter
register unsigned long counter asm("r13");
// unsigned long counter;

void* counter_thread(){
    // affinity is set to some isolated CPU so the noise will be minimal

    while(1){
        //counter++; // Line 1*
        asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
    }
}

void* measurement_thread(){
    // affinity is set somewhere over here
    unsigned long meas = 0;
    unsigned long a = 5;
    unsigned long r1,r2;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a *=3; // dummy operation that I want to measure
        r2 = counter;
        mfence();
        meas = r2-r1;
        printf("counter:%ld \n", counter);
        break;
    }
}

Let me explain what I have done so far:

Since I want the counter to be accurate, I am setting the affinity to an isolated CPU. Also, If I use the counter in Line 1*, the dissassambled function will be:

 d4c:   4c 89 e8                mov    %r13,%rax
 d4f:   48 83 c0 01             add    $0x1,%rax
 d53:   49 89 c5                mov    %rax,%r13
 d56:   eb f4                   jmp    d4c <counter_thread+0x37>

Which is not 1 cycle operation. That is why I have used inline assembly to decrease 2 mov instructions. Using the inline assembly:

 d4c:   49 83 c5 01             add    $0x1,%r13
 d50:   eb fa                   jmp    d4c <counter_thread+0x37>

But the thing is, both implementations are not working. The other thread cannot see the counter being updated. If I make the global counter value not a register, then it is working, but I want to be precise. If I make global counter value to unsigned long counter then the disassembled code of counter thread is:

 d4c:   48 8b 05 ed 12 20 00    mov    0x2012ed(%rip),%rax        # 202040 <counter>
 d53:   48 83 c0 01             add    $0x1,%rax
 d57:   48 89 05 e2 12 20 00    mov    %rax,0x2012e2(%rip)        # 202040 <counter>
 d5e:   eb ec                   jmp    d4c <counter_thread+0x37>

It works but it doesn't give me the granularity that I want.

EDIT:

My environment:

  • CPU: AMD Ryzen 3600
  • kernel: 5.0.0-32-generic
  • OS: Ubuntu 18.04

EDIT2: I have isolated 2 neighbor CPU cores (i.e. core 10 and 11) and running the experiment on those cores. The counter is on one of the cores, measurement is on the other. Isolation is done by using /etc/default/grub file and adding isolcpus line.

EDIT3: I know that one measurement is not enough. I have run the experiment 10 million times and looked at the results.

Experiment1: Setup:

unsigned long counter =0;//global counter 
void* counter_thread(){
    mfence();
    while(1)
        counter++;
}
void* measurement_thread(){
    unsigned long i=0, r1=0,r2=0;
    unsigned int a=0;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a +=3;
        r2 = counter;
        mfence();
        measurements[r2-r1]++;
        i++;
        if(i == MILLION_ITER)
            break;   
    }
}

Results1: In 99.99% I got 0. Which I expect because either first thread is not running, or OS or other interrupts disturb the measurement. Getting rid of the 0's and very high values gives me 20 cycles of measurement on the average. (I was expecting 3-4 because I only do an integer addition).

Experiment2:

Setup: Identically the same as above, one difference is, instead of global counter, I use the counter as register:

register unsigned long counter asm("r13");

Results2: Measurement thread always reads 0. In disassembled code, I can see that both are dealing with R13 register (counter), however, I believe that it is not somehow shared.

Experiment3:

Setup: Identical to the setup2, except in the counter thread, instead of doing counter++, I am doing an inline assembly to make sure that I am doing 1 cycle operation. My disassembled file looks like this:

 cd1:   49 83 c5 01             add    $0x1,%r13
 cd5:   eb fa                   jmp    cd1 <counter_thread+0x37>

Results3: Measurement thread reads 0 as above.

解决方案

Each thread has its own registers. Each logical CPU core has its own architectural registers which a thread uses when running on a core. Only signal handlers (or on bare metal, interrupts) can modify the registers of their thread.

Declaring a GNU C asm register-global like your ... asm("r13") in a multi-threaded program effectively gives you thread-local storage, not a truly shared global.

Only memory is shared between threads, not registers. This is how multiple threads can run at the same time without stepping on each other, each using their registers.

Registers that you don't declare as register-global can be used freely by the compiler, so it wouldn't work at all for them to be shared between cores. (And there's nothing GCC can do to make them shared vs. private depending on how you declare them.)

Even apart from that, the register global isn't volatile or atomic so r1 = counter; and r2 = counter; can CSE so r2-r1 is a compile-time-constant zero even if your local R13 was changing from a signal handler.


How can I make sure that both of the threads are using registers for read/write operation of the counter value?

You can't do that. There is no shared state between cores that can be read/written with lower latency than cache.

If you want to time something, consider using rdtsc to get reference cycles, or rdpmc to read a performance counter (which you might have set up to be counting core clock cycles).

Using another thread to increment a counter is unnecessary, and not helpful because there's no very-low-overhead way to read something from another core.


The rdtscp instruction in my machine gives 36-72-108... cycle resolution at best. So, I cannot distinguish the difference between 2 cycles and 35 cycles because both of them will give 36 cycles.

Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.

But more importantly, you can't usefully measure a *=3; in C in terms of a single cost in cycles. First of all, it can compile differently depending on context.

But assuming a normal lea eax, [rax + rax*2], a realistic instruction cost model has 3 dimensions: uop count (front end), back-end port pressure, and latency from input(s) to output. https://agner.org/optimize/

See my answer on RDTSCP in NASM always returns the same value for more about timing a single instruction. Put it in a loop in different ways to measure throughput vs. latency, and look at perf counters to get uops->ports. Or look at Agner Fog's instruction tables and https://uops.info/ because people have already done those test.

Also

Again, these are how you time a single asm instruction, not a C statement. With optimization enabled the cost of a C statement can depend on how it optimizes into the surrounding code. (And/or whether latency of surrounding operations hides its cost, on an out-of-order execution CPU like all modern x86 CPUs.)

这篇关于一个线程计数,另一线程执行工作和测量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆