超线程处理器内核可以完全同时执行两个线程吗? [英] Can a hyper-threaded processor core execute two threads at the exact same time?

查看:101
本文介绍了超线程处理器内核可以完全同时执行两个线程吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解超线程.如果逻辑核心实际上不存在,那么使用超线程有什么意义呢?维基百科文章指出:

对于物理上存在的每个处理器核心,操作系统都对两个虚拟(逻辑)核心进行寻址,并在可能的情况下共享它们之间的工作负荷.

如果两个逻辑内核共享相同的执行单元,这意味着一个线程将不得不搁置,而另一个线程要执行,也就是说,我不了解超线程如何有用,因为您实际上并没有引入新的执行单元.我不能为此缠住头

解决方案

请参见 现代微处理器90分钟指南!

您有一个带有很多执行单元的CPU和一个前端,可以使它们大部分都具有要做的工作,但只能在良好的条件下进行.诸如高速缓存未命中或分支错误预测之类的失速,或仅是有限的并行性(例如,执行一长串FP添加的循环,每4或5个时钟而不是一个或两个一次添加FP延迟的瓶颈(标量或SIMD))每个时钟)将导致每个周期的吞吐量远远少于4条指令,并使执行单元处于空闲状态.

HT的要点(通常是同步多线程(SMT))就是让那些饥饿的执行单元忙于工作,即使在运行具有低ILP或大量停顿的代码时也是如此(缓存未命中/分支预测错误).

SMT仅在管道中添加了一些额外的逻辑,因此它可以同时跟踪两个单独的体系结构上下文.因此,与具有两倍或四倍于全内核的内核相比,它的裸片面积和功耗要少得多.(Knight's Landing Xeon Phi每个内核运行4个线程,主流的Intel CPU运行2个线程.某些非x86芯片每个内核运行8个线程,主要针对数据库服务器类型的工作负载.)


常见的误解

超线程不是优化的上下文切换.可以在缓存未命中时切换到其他线程的更简单设计,但是HT比这更先进.

两个线程处于活动状态时,前端在每个周期(在获取,解码和发布/重命名阶段)在线程之间交替,但是乱序内核实际上可以从两个逻辑内核执行uops在同一周期.

在通常交替的管道阶段中,只要一个线程停止运行,另一线程就会获得该阶段的所有循环.HT比固定交替要好得多,因为一个线程可以完成很多工作,而另一个线程正在从分支的错误预测中恢复或等待高速缓存未命中.

请注意,一次最多可以处理10个高速缓存未命中(从Intel CPU的L1D高速缓存中:这是LFB(行填充缓冲区)的数量,并且内存请求是流水线的.但是,下一次加载的地址是否取决于在较早的加载中(例如,指针在树或链表中移动),CPU不知道从何处加载并且无法保持多个请求的执行,因此这对于两个线程等待缓存未命中实际上是有用的并行.

当两个线程处于活动状态时,某些资源将进行静态分区,而竞争地共享一些资源.参见这张幻灯片pdf 了解更多信息.(有关如何为Intel和AMD CPU实际优化asm的更多详细信息,请参阅 Agner Fog的微体系结构PDF .)


当一个逻辑核心睡眠"时,(即内核运行 HLT 指令或任何 MWAIT 进入更深的睡眠),物理内核转换为单线程模式,并让仍处于活动状态的逻辑内核拥有所有资源(包括完整的ReOrder Buffer大小以及其他静态分区的资源),因此与在另一个线程只是由于高速缓存未命中而停顿时相比,在仍在运行的单个线程中查找和利用ILP的能力增加了更多.>


顺便说一句,使用HT实际上某些工作负载运行速度较慢.如果您的工作集几乎不适合L2或L1D高速缓存,那么在同一内核上运行两个将导致更多的高速缓存未命中.对于非常好的高吞吐量代码,这些代码已经可以使执行单元保持饱和(就像在高性能计算中优化的矩阵相乘)一样,禁用HT也很有意义.始终保持基准.

在Skylake上,我发现在我的四核i7-6700k上,使用8线程而不是4线程,视频编码(x265 -预设速度更慢,1080p)快15%.我实际上并没有为4线程测试禁用HT,但是Linux的调度程序擅长在有足够的余地时不反弹线程,也不会在单独的物理内核上运行线程.考虑到x265具有大量的手写asm,并且即使它本身具有完整的内核,也可以使每个周期运行非常高的指令,因此将其提高15%的速度是非常不错的.(像我以前使用的较慢的预设往往受CPU限制而不是受内存限制.)

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:

For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.

If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this

解决方案

See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors A 90-Minute Guide!

You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.

The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).

SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.)


Common misconceptions

Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that.

With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order core can actually execute uops from both logical cores in the same cycle.

In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.

Note that up to 10 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.

Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)


When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.


BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated (like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.

On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

这篇关于超线程处理器内核可以完全同时执行两个线程吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆