什么用于线程之间的数据交换正在一个核心与HT上执行? [英] What will be used for data exchange between threads are executing on one Core with HT?

查看:248
本文介绍了什么用于线程之间的数据交换正在一个核心与HT上执行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


解决方案

我认为将获得L1的往返。即使存储转发不会发生,因为难以从其他核心(前几个段落)查看来自非退役商店的数据,L1缓存命中应该仍然发生(最后一段)。






Intel的 x86手册vol3,第11.5.6章文档Netburst(P4)有一个选项这样工作。默认为自适应模式,它允许内核中的逻辑处理器共享数据。



有一个共享模式:


在共享模式下,L1数据高速缓存在逻辑处理器之间竞争共享。



在共享模式下,L1数据高速缓存中的线性地址可以被别名,意味着高速缓存
中的一个线性地址可以指向不同的物理位置。解决混叠的机制可能导致抖动。对于此
原因,IA32_MISC_ENABLE [bit 24] = 0是基于支持Intel超线程技术的Intel NetBurst
微体系结构的处理器的首选配置


对于Nehalem / SnB uarches中的超线程,这里没有任何说明,所以我假设他们在另一个uarch中引入HT支持时不包括慢模式支持,因为他们知道他们会得到快速模式在netburst中正确工作。我想知道这种模式位是否存在,以防他们发现一个错误,并必须用微码更新来禁用它。



本回答的其余部分仅涉及正常设置P4,我很肯定也是nehalem和SnB-family CPU的工作方式。






在x86上,来自GP寄存器的存储器原子。如果使用C ++ 11 std :: atomic ,则存储的内容小于默认完全顺序一致性内存排序编译为正常商店。 (MO_seq_cst编译为存储+ mfence ,以便对所有其他加载进行存储排序,因为 x86的内存模型需要用于seq cst的StoreLoad障碍。)原子读 - 修改 - 写操作,如原子增量,将产生一个锁定指令如 lock inc [mem] ,这将仍然使受影响的缓存线在核心的L1缓存中热。



在一个超线程中的存储和另一个超线程中的负载可能能够利用存储 - >负载转发就像商店后面是单线程加载。我没有测试这个,但这是我最好的猜测,基于了解HT如何共享和分区一些乱序资源。在Intel Sandybridge上,存储 - >负载延迟约为6个周期。 (例如 add [mem],imm32 的延迟为6个周期。)



从一个线程中的未退出的存储转发到另一个线程中的加载将使加载线程遭受回滚,如果存储线程发现存储器需要不退出(例如,在存储被发现之前的分支误预测,或者早期insn故障)。因此,完全可能的是,存储 - >超线程之间的加载延迟涉及到L1的完全访问。



因此,在同一个核心上运行的超线程的两个线程仍然可能会看到StoreLoad re - 如果存储转发不在线程之间发生,则排序。 Jeff Preshing的内存重新排序在法案中捕获代码可以使用以在实践中测试它,使用CPU关联来在相同物理核心的不同逻辑CPU上运行线程。我没有HT CPU,所以我不能测试自己。我在英特尔2002年的关于超线程的论文中没有发现任何问题netburst微体系结构。这篇文章真的很古老,netburst microarch家族与P6或Sandybridge微系列家族有很大的不同。除非使用 movNT 弱有序存储(<0:0>),否则SnB使用回写L1缓存而不是直写。



<和高速缓存旁路,因此非时间),存储数据不能仅进入存储缓冲器而不进入作为高速缓存一致性协议的一部分的L1。加载指令必须探测存储缓冲区,用于存储 - >负载转发,但是正如我所说,IDK如果加载uops将探测来自其他核心的存储。






L1命中应该是可能的。英特尔在大多数(所有?)中使用虚拟索引,物理标记 L1缓存的设计。 (例如 Sandybridge家庭)。同一进程的两个线程将具有相同的虚拟到物理映射,所以它们都将查看同一组L1缓存(每个集合包含并行检查的8个标签,或者Skylake上的4个标签,因为它们减少了关联性以节省功耗)。物理地址当然是一样的,所以在L1中会有一个负载。



两个进程,一块共享内存映射到不同的虚拟地址将不会有相同的物理地址的相同虚拟地址,但最后12位将是相同的(因为那些位是4k页内的偏移量)。最低6位是高速缓存行内的偏移,而接下来的6位可以索引一个集合。 2 ^ 6 = 64。64B * 64组* 8组每组= 32kiB = L1高速缓存大小。



此回答的先前版本中有一段基于不正确的想法Skylake已减少L1结合性。它是Skylake的L2是4路,而8路在Broadwell和更早。不过,






由于缓存标记使用物理地址,它们可以(并且)在两个逻辑核上由不同进程竞争地共享,而没有假阳性缓存命中的风险。


Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel.

These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a stalled logical processor to borrow resources from the other one.

In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.

But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?

What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?

Sandy Bridge Intel CPU - cache L1:

  • 32 KB - cache size
  • 64 B - cache line size
  • 512 - lines (512 = 32 KB / 64 B)
  • 8-way
  • 64 - number sets of ways (64 = 512 lines / 8-way)
  • 6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
  • 4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
  • low 12 bits - significant for determining the current set number

  • 4 KB - standard page size

  • low 12 bits - the same in virtual and physical addresses for each address

解决方案

I think worst case: you'll get a round-trip to L1. Even if store-forwarding doesn't happen because of the difficulty in looking at data from non-retired stores from the other core (first few paragraphs), L1 cache hits should still happen (last paragraphs).


Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.

There is a "shared mode":

In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the logical processors use identical CR3 registers and paging modes.

In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst microarchitecture that support Intel Hyper-Threading Technology

It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.

The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way nehalem and SnB-family CPUs work.


On x86, normal stores from GP registers are atomic. If you use C++11 std::atomic, a store with anything less than the default full sequential consistency memory ordering compiles to just a normal store. (MO_seq_cst compiles to a store + mfence, to order the store with respect to all other loads, since x86's memory model needs StoreLoad barriers for seq cst.) Atomic read-modify-write operations, like atomic increment, will produce a locked instruction like lock inc [mem], which will still leave the affected cache line hot in the core's L1 cache.

A store in one hyperthread and a load in the other hyperthread may be able to take advantage of store->load forwarding just like a store followed by a load within a single thread. I haven't tested this, but this is my best guess based on understanding of how HT shares and partitions some out-of-order resources. On Intel Sandybridge, the store->load latency is ~6 cycles. (e.g. add [mem], imm32 has a latency of 6 cycles.)

However, store->load forwarding from not-yet-retired stores in one thread to loads in the other thread would make the loading thread subject to rollback if the storing thread discovered that the store needed to not retire (e.g. a branch mispredict before the store was discovered, or an earlier insn faulted). So it's entirely possible that the store->load latency between hyperthreads involves a full trip to L1.

So two threads running on the same core with hyperthreading still might see StoreLoad re-ordering, if store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core. I don't have a HT CPU, so I can't test myself. I didn't find anything about this in Intel's 2002 paper on hyperthreading in the netburst microarchitecture. That paper is really old, and the netburst microarch family is very different from either P6 or Sandybridge microarch families. SnB uses write-back L1 cache, not write-through.

Unless you use movNT weakly-ordered stores (and cache bypassing, hence the Non-Temporal), the store data can't just go into a store buffer without going into L1 as part of the cache-coherency protocol. Load instructions do have to probe store buffers for store->load forwarding, but as I said, IDK if load uops will probe stores from the other core.


L1 hits should be possible. Intel uses virtually indexed, physically tagged L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) Two threads of the same process will have the same virtual to physical mapping, so they will both look in the same set of the L1 cache (each set contains 8 tags which are checked in parallel. Or 4 on Skylake since they reduced the associativity to save power). The physical address is of course the same, so a load will hit in L1.

Two processes with a chunk of shared memory mapped to different virtual addresses won't have the same virtual address for the same physical address, but the last 12 bits will be the same (because those bits are the offset within the 4k page). The lowest 6 bits are the offset within a cache line, while the next 6 bits can index a set. 2^6 = 64. 64B * 64 sets * 8 ways per set = 32kiB = L1 cache sizes.

A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.


Since the cache-tags use physical addresses, they can be (and are) competitively shared by different processes on the two logical cores, without risk of false-positive cache hits.

这篇关于什么用于线程之间的数据交换正在一个核心与HT上执行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆