为超线程创建友好的定时繁忙循环 [英] Creating a friendly timed busy loop for a hyperthread

查看:112
本文介绍了为超线程创建友好的定时繁忙循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下我想让一个主线程和一个 helper线程作为两个超线程在同一物理核心上运行(可能是通过强迫它们的亲和力来大致确保这一点)

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).

主线程将进行重要的IPC和CPU密集型工作. helper线程除了定期更新主线程将定期读取的共享时间戳值外,什么都不做.更新频率是可配置的,但可能高达100 MHz或更高.这样的快速更新或多或少排除了基于睡眠的方法,因为阻塞睡眠太慢而无法在10纳秒(100 MHz)的时间内睡眠/唤醒.

The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.

所以我想忙些等待.但是,繁忙的等待应该对主线程尽可能地友好:使用尽可能少的执行资源,从而尽可能减少主线程的开销.

So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.

我想这个想法将是一个长等待时间的指令,它不使用很多资源,例如pause,并且具有固定的和已知的延迟.这将使我们能够校准睡眠"时间段,因此甚至不需要读取时钟(如果要使用时间段P更新,我们只需将这些指令的P/L发出以用于校准的忙碌睡眠即可.pause不需要满足后一个条件,因为它的延迟时间变化很大 1 .

I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.

第二个选择是即使延迟未知,也要使用长等待时间指令,然后在每条指令之后执行rdtsc或其他一些时钟读取方法(clock_gettime等),以查看我们实际需要多长时间.睡了似乎它可能会减慢主线程的速度.

A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.

还有更好的选择吗?

1 另外,pause在防止推测性内存访问方面也具有一些特定的语义,这对同级线程场景可能有好处,也可能没有好处,因为我实际上不在自旋等待循环中.

1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

推荐答案

一些随机的想法.

因此,您希望在100 MHz样本上有一个时间戳,这意味着在4GHz cpu上,每个调用之间有40个周期.

So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.

计时器线程忙于读取实时时钟(RTDSC ???),但是不能将save方法与cpuid一起使用,因为它需要100个周期.旧的实时时钟的延迟大约为25(吞吐量为1/25),可能会有更新,更精确的延迟计时器(32个周期).

The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).

  start:
  read time (25 cycles)
  tmp = time - last (1 cycle)
  if tmp < sample length goto start
  last += cycles between samples
  sample = time
  goto start

在理想情况下,分支预测器每次都会猜对,实际上,由于读取时间周期的差异,分支预测器会随机错误地向26个循环中添加5-14个循环.

In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.

在写入样本时,另一个线程将从该高速缓存行的第一个推测负载中取消其指令(请记住,将其对齐到样本位置的64个字节,以便不影响其他数据).样本时间戳的加载会在大约5-14个周期的延迟后重新开始,具体取决于指令的来源,循环缓冲区,微操作缓存或I缓存.

When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.

因此,除了另一个线程使用的CPU的一半外,最少将损失5个> 14个周期/40个周期的性能.

So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.

另一方面,读取主线程中的实时时钟会花费...

On the other hand reading the real time clock in the main thread would cost ...

〜1/4个周期,延迟很可能会被其他指令覆盖.但是您不能改变频率.除非之前有其他一些长延迟指令,否则25个周期的长延迟可能是个问题.

~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.

使用CAS指令(lock exch ???)可能部分解决了该问题,因为加载后不应引起指令的重新发出,而是导致随后所有后续读取和写入的延迟.

Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.

这篇关于为超线程创建友好的定时繁忙循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆