为什么SSE指令保留YMM寄存器的高128位? [英] Why do SSE instructions preserve the upper 128-bit of the YMM registers?

查看:136
本文介绍了为什么SSE指令保留YMM寄存器的高128位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎是一个重复出现的问题,许多英特尔处理器(直到Skylake,除非我说错了)都表现不佳.在将AVX-256指令与SSE指令混合使用时.

It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions.

根据英特尔的文档,这是由于SSE指令被定义为保留YMM寄存器的高128位所致,因此,为了不使用AVX数据路径的高128位来节省功耗,CPU将存储执行SSE代码时这些位会消失,而在输入AVX代码时会重新加载它们,因此存储和加载都很昂贵.

According to Intel's documentation, this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive.

但是,我找不到明显的理由或解释为什么SSE指令需要保留那些高128位.通过始终清除YMM寄存器的高128位而不是保留它们,相应的128位VEX指令(使用该指令可避免性能下降).在我看来,当英特尔定义AVX体系结构(包括将XMM寄存器扩展为YMM寄存器)时,他们本可以简单地定义SSE指令也将清除高128位.显然,由于YMM寄存器是新寄存器,因此不会有保留SSE指令保留这些位的遗留代码,在我看来,英特尔也很容易看到这种情况.

However, I can find no obvious reason or explanation why SSE instructions needed to preserve those upper 128 bits. The corresponding 128-bit VEX instructions (the use of which avoids the performance penalty) work by always clearing the upper 128 bits of the YMM registers instead of preserving them. It seems to me that, when Intel defined the AVX architecture, including the extension of the XMM registers to YMM registers, they could have simply defined that the SSE instructions, too, would clear the upper 128 bits. Obviously, since the YMM registers were new, there could have been no legacy code that would have depended on SSE instructions preserving those bits, and it also appears to me that Intel could have easily seen this coming.

那么,英特尔为何定义SSE指令以保留YMM寄存器的高128位的原因是什么?它有用吗?

So, what is the reason why Intel defined the SSE instructions to preserve the upper 128 bits of the YMM registers? Is it ever useful?

推荐答案

为了将外部资源移到现场,我从

In order to move external resources in-site, I've extracted the relevant paragraphs from the link Michael provided in the comments.

所有学分归他所有.
该链接指向Agner Fog在英特尔论坛上问的一个非常相似的问题.

All credits go to him.
The link points to a very similar question Agner Fog asked on the Intel's forum.

[了解Intel的回答] 如果我理解正确,您决定必须对所有128位指令使用两个版本,以避免 万一中断使用旧版XMM指令调用设备驱动程序,则破坏YMM寄存器的上部.

[Fog in respone to Intel's answer] If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.

英特尔担心,通过使旧版SSE指令将XMM寄存器的上部清零,ISR现在会突然 影响新的YMM寄存器.
如果不支持保存新的YMM上下文,则在任何情况下都无法使用AVX 情况.

Intel were concerned that by making legacy SSE instructions zeroing the upper part of the XMM registers the ISRs would now suddenly affect the new YMM registers.
Without support for saving the new YMM context this would make the use of AVX impossible under any circumstances.

但是Fog并不完全满意,并指出仅通过使用支持AVX的编译器重新编译驱动程序即可(因此VEX 使用说明)将导致相同的结果.

However Fog was not completely satisfied and pointed out that by simply recompiling a driver with an AVX aware compiler (so that VEX instruction were used) would result in the same outcome.

Intel答复说,他们的目标是避免强迫现有软件被强制使用. 改写.

Intel replied that their goal was to avoid forcing existing software to be rewritten.

我们无法强迫业界重写/修复其所有现有驱动程序(例如,使用XSAVE),也无法保证它们会成功完成.例如,考虑一下业界从32位到64位操作系统过渡时仍在经历的痛苦!我们从OS供应商那里得到的反馈还排除了为ISR服务增加开销的情况,以在每个中断时增加状态管理的开销.我们不想对通常不使用宽向量的行业部分造成这两种费用.

There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors.

通过使用两种版本的指令,可以像在FPU/SSE中一样实现对驱动程序中AVX的支持:

By having two versions of the instructions, support for AVX in drivers can be achieved like it has been for FPU/SSE:

给出的示例类似于当前的情况,即环0驱动程序(ISR)供应商尝试使用浮点状态,或在某些库中意外链接了浮点状态,而这些OS并没有在Ring-ring上自动管理该上下文. 0.这是众所周知的错误来源,我只能提出以下建议:

The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:

  • 在这些操作系统上,不鼓励驱动程序开发人员使用浮点或AVX

  • On those OSs, driver developers are discouraged from using floating-point or AVX

应鼓励驱动程序开发人员在驱动程序验证期间禁用硬件功能(即,Ring-0中的驱动程序可以通过XSETBV()禁用AVX状态

Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV()

这篇关于为什么SSE指令保留YMM寄存器的高128位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆