MASKMOVDQU为什么不扩展到256位和512位存储? [英] Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?

查看:113
本文介绍了MASKMOVDQU为什么不扩展到256位和512位存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

MASKMOVDQU 1 在x86存储指令中是特殊的,因为,原则上,它允许您将单个字节存储在高速缓存行中,而无需先将整个高速缓存行一直加载到内核,这样写入的字节可以与未覆盖的现有字节合并。

The MASKMOVDQU1 is special among x86 store instructions because, in principle, it allows you to store individual bytes in a cache line, without first loading the entire cache line all the way to the core so that the written bytes can be merged with the not-overwritten existing bytes.

似乎可以使用与NT存储区相同的机制来工作:在不先执行RFO的情况下将高速缓存行向下推。根据英特尔软件开发手册(重点是我的):

It would seem to works using the same mechanisms as an NT store: pushing the cache line down without first doing an RFO. Per the Intel software develope manual (emphasis mine):


MASKMOVQ指令可用于提高需要的
算法的性能以字节为单位合并数据。
应该不会引起所有权的读取;这样做会产生不必要的
带宽,因为直接使用字节掩码
写入数据而无需在存储之前分配旧数据

但是,与其他NT存储区不同,可以使用掩码指定实际写入的字节。

Unlike other NT stores, however, you can use a mask to specify which bytes are actually written.

在如果您想在不太适合缓存任何级别的大区域上进行稀疏的按字节粒度写入,则此指令似乎很不错。

In the case that you want to make sparse byte-granular writes across a large region which isn't likely to fit in any level of the cache, this instruction seems idea.

与几乎所有其他有用的指令不同,英特尔没有在AVX / AVX2或AVX-512中将指令扩展到256或512位。这是否表明不再建议使用此指令,或者可能无法在当前或将来的体系结构上有效实现?

Unlike almost every other useful instruction, Intel haven't extended the instruction to 256 or 512 bits in AVX/AVX2 or AVX-512. Does this indicate that the use of this instruction is no longer recommended, perhaps cannot be implemented efficiently on current or future architectures?

1 ...及其在MMX中的64位前身 MASKMOVQ

1 ... and its 64-bit predecessor in MMX MASKMOVQ.

推荐答案

我怀疑带遮罩的NT向量存储对于多核CPU不再适用,因此,甚至128-如果在完整的64字节行中有任何未修改的字节,则位版本仅在现代x86上很烂。

I suspect that masked NT vector stores no longer work well for multi-core CPUs, so probably even the 128-bit version just sucks on modern x86 for masked writes, if there are any unmodified bytes in a full 64-byte line.

(常规的带掩码向量存储返回时具有复仇性)在AVX512BW字节掩码向量中;对此似乎有效地支持对L1d缓存的掩码提交,并且使用AVX1 vmaskmovps / pd 和等效整数以及AVX512F进行dword / qword掩码

(Regular masked vector stores are back with a vengeance in AVX512BW byte-masked vectors; masked commit to L1d cache seems to be efficiently supported for that, and dword / qword masking with AVX1 vmaskmovps/pd and integer equivalent, and AVX512F)

SDRAM(包括DDR4)总线协议确实支持字节掩码写入(带有1条掩码行p er字节作为高速缓存行突发传输的一部分)。 此英特尔文档(关于FPGA或其他内容)包括对 DM (数据掩码)信号的讨论,确认DDR4仍然具有它们,并且具有相同的功能如Wikipedia上针对SDRAM所述的DQM行 https://en.wikipedia.org/wiki / Synchronous_dynamic_random-access_memory#SDR_SDRAM 。 (DDR1将其更改为仅写掩码,而不是读掩码。)

The SDRAM (including DDR4) bus protocol does support byte-masked writes (with 1 mask line per byte as part of a cache-line burst transfer). This Intel doc (about FPGAs or something) includes discussion of the DM (data mask) signals, confirming that DDR4 still has them, with the same function as the DQM lines described on Wikipedia for SDRAM https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDR_SDRAM. (DDR1 changed it to write-mask only, not read-mask.)

因此具有硬件功能,并且可能是现代x86 CPU使用它进行单字节写操作。

So the hardware functionality is there, and presumably modern x86 CPUs use it for single-byte writes to uncacheable memory, for example.

如果我们写完整的一行,则没有RFO的存储很棒:我们只是使其他行无效该行的副本并存储到内存。

No-RFO stores are great if we write a full line: we just invalidate other copies of the line and store to memory.

约翰带宽博士麦卡平(John McCalpin)说,普通NT存储区在填满完整的64字节行 后刷新,即使

John "Dr. Bandwidth" McCalpin says that normal NT stores that flush after filling a full 64-byte line will invalidate even lines that are dirty, without causing a writeback of the dirty data.

所以被屏蔽的 NT存储区需要使用不同的这种机制,因为任何被屏蔽的字节都需要从另一个内核的脏线中获取其值,而不是从DRAM中的任何内容中获取它们。

So masked NT stores need to use a different mechanism, because any masked-out bytes need to take their value from the dirty line in another core, not from whatever was in DRAM.

如果机制对于部分专线的NT商店效率不高,添加创建它的新说明是不明智的。我不知道它的效率是比正常存储部分生产线高还是低,或者这取决于情况和状况。

If the mechanism for partial-line NT stores isn't efficient, adding new instructions that create it is unwise. I don't know if it's more or less efficient than doing normal stores to part of a line, or if that depends on the situation and uarch.

它不一定完全是RFO,但这意味着当这样的存储到达内存控制器时,它将必须获取探听过滤器以确保该行是同步的,或者

It doesn't have to be a RFO exactly, but it would mean that when such a store reaches the memory controller, it would have to get the snoop filter to make sure the line is in sync, or maybe merge with the old contents from cache before flushing to DRAM.

或者CPU核心可以执行RFO并合并,然后发送全行减记$ b $。 b内存层次。

Or the CPU core could do an RFO and merge, before sending the full-line write down the memory hierarchy.

CPU在回收尚未写入所有64字节的LFB时,确实已经需要某种机制来刷新部分行NT存储,而且我们知道那没有效率。 (但我忘记了细节。)也许这就是 maskmovdqu 在现代CPU上的执行方式,无论是始终执行还是不修改任何字节。

CPUs do already need some kind of mechanism for flushing partial-line NT stores when reclaiming an LFB that hasn't had all 64 bytes written yet, and we know that's not as efficient. (But I forget the details.) But maybe this is how maskmovdqu executes on modern CPUs, either always or if you leave any bytes unmodified.

一个实验可能会发现。

所以TL:DR maskmovqdu 可能仅在单核CPU中有效实现。它起源于Katmai Pentium III,带有MMX maskmovq mm0,mm1 ; SMP系统已经存在,但在设计时可能不是该指令的主要考虑因素。 SMP系统没有共享最后一级的高速缓存,但是它们在每个套接字上仍然具有私有的回写式L1d高速缓存。

So TL:DR maskmovqdu may have only been implemented efficiently in single-core CPUs. It originated in Katmai Pentium III with MMX maskmovq mm0, mm1; SMP systems existed, but maybe weren't the primary consideration for this instruction when it was being designed. SMP systems didn't have shared last-level cache, but they did still have private write-back L1d cache on each socket.

这篇关于MASKMOVDQU为什么不扩展到256位和512位存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆