避免AVX-SSE(VEX)过渡处罚 [英] Avoiding AVX-SSE (VEX) Transition Penalties

查看:140
本文介绍了避免AVX-SSE(VEX)过渡处罚的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的64位应用程序有很多代码(尤其是在标准库中),这些代码在SSE模式下使用xmm0-xmm7寄存器.

Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode.

我想使用ymm寄存器实现快速内存复制.我不能修改所有使用xmm寄存器添加VEX前缀的代码,而且我也认为这不切实际,因为这会增加代码的大小,从而导致运行速度变慢,因为需要CPU解码较大的指令

I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions.

我只想使用两个ymm寄存器(可能还有zmm寄存器-支持zmm的价格合理的处理器有望在今年推出)来实现快速内存复制.

I just want to use two ymm registers (and possibly zmm - the affordable processors supporting zmm are promised to be available this year) for fast memory copy.

问题是:如何使用ymm寄存器但避免过渡罚款?

Question is: how to use the ymm registers but avoid the transition penalties?

仅使用ymm8-ymm15寄存器(而不是ymm0-ymm7)会产生惩罚吗? SSE最初有8个128位寄存器(xmm0-xmm7),但是在64位模式下,也有(xmm8-xmm15)可以用于非VEX前缀的指令.但是,我已经检查了我们的64位应用程序,它仅使用xmm0-xmm7,因为它也具有带有几乎相同代码的32位版本.仅当CPU尝试使用以前曾用作ymm并且具有较高128位非零值的xmm寄存器时,惩罚才会发生吗?快速内存复制后将我使用过的ymm寄存器清零不是更好吗?例如,我曾经使用ymm寄存器复制32个字节的内存-将它归零的最快方法是什么? "vpxor ymm15,ymm15,ymm15"足够快吗? (AFAIK,vpxor可以在3个ALU执行端口p0/p1/p5上执行,而vxorpd只能在p5上执行).是不是时候将其归零,而不是仅仅使用它来复制32字节内存的收益?

Will the penalty occur when I use just ymm8-ymm15 registers (not ymm0-ymm7)? SSE originally had eight 128-bit registers (xmm0-xmm7), but in 64-bit mode there are (xmm8-xmm15) also available for non-VEX-prefixed instructions. However, I have reviewed our 64-bit application and it only use xmm0-xmm7, since it also has a 32-bit version with almost the same code. Does the penalty only occur when the CPU tries in fact to use an xmm register that had been used before as ymm and has one of higher 128 bits non-zero? Isn't it better to just zeroize the ymm registers that I have used after the fast memory copy? For example, I have used an ymm register once to copy 32 bytes of memory - what is the fastest way to zeroize it? Is "vpxor ymm15, ymm15, ymm15" fast enough? (AFAIK, vpxor can be executed on any of the 3 ALU execution ports, p0/p1/p5, while vxorpd can only be execute on p5). Wouldn't be the time to zeroize it more than the gain of using it to just copy 32 bytes of memory?

推荐答案

另一种可能性是使用寄存器zmm16-zmm31.这些调节器没有非VEX对应物.将zmm16-zmm31与非VEX SSE代码混合不会产生状态转换,也不会造成任何损失.这些512位寄存器仅在64位模式下可用,并且仅在具有AVX512的处理器上可用.

Another possibility is to use registers zmm16 - zmm31. These regsters have no non-VEX counterpart. There is no state transition and no penalty for mixing zmm16 - zmm31 with non-VEX SSE code. These 512-bit registers are only available in 64 bit mode and only on processors with AVX512.

这篇关于避免AVX-SSE(VEX)过渡处罚的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆