使用AVX指令会禁用exp()优化吗? [英] Using AVX instructions disables exp() optimization?

查看:218
本文介绍了使用AVX指令会禁用exp()优化吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AVX内部函数在VC ++中编写前馈网络.我通过C#中的PInvoke调用此代码.当调用一个计算包含函数exp()的大循环的函数时,对于160M的循环大小,我的性能约为1000ms.一旦我调用使用AVX内在函数的 any 函数,然后随后使用exp(),对于同一操作,我的性能就会下降到约8000ms.请注意,计算exp()的函数是标准C,使用AVX内部函数的调用在处理数据方面可能是完全不相关的.在运行时某个地方的某种标志被绊倒了.

I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be completely unrelated in terms of data being processed. Some kind of flag is getting tripped somewhere at runtime.

换句话说,

A(); // 1000ms calculates 160M exp() 
B(); // completely unrelated but contains AVX
A(); // 8000ms

或者奇怪的是,

C(); // contains 128 bit SSE SIMD expressions
A(); // 1000ms

我不知道这里发生了什么可能的机制,或者如何寻求解决方案.我使用的是Intel 2500K cpu \ Win7.VS的Express版本.

I am lost as to what possible mechanism is going on here, or how to pursue a sol'n. I'm on an Intel 2500K cpu\Win 7. Express versions of VS.

谢谢.

推荐答案

如果使用任何AVX256指令,则"AVX高状态"将变为脏",如果随后使用SSE指令(包括标量),则会导致较大的停顿在xmm寄存器中执行浮点运算).英特尔优化手册中对此进行了说明,您可以免费下载(如果需要,必须阅读您正在做这种工作):

If you use any AVX256 instruction, the "AVX upper state" becomes "dirty", which results in a large stall if you subsequently use SSE instructions (including scalar floating-point performed in the xmm registers). This is documented in the Intel Optimization Manual, which you can download for free (and is a must-read if you're doing this sort of work):

AVX指令始终会修改YMM寄存器的高位,而SSE指令不会修改高位.从硬件的角度来看,可以将YMM寄存器集合的高位视为以下三种状态之一:

AVX instruction always modifies the upper bits of YMM registers and SSE instructions do not modify the upper bits. From a hardware perspective, the upper bits of the YMM register collection can be considered to be in one of three states:

•清除:YMM的所有高位为零.这是处理器从RESET启动时的状态.

• Clean: All upper bits of YMM are zero. This is the state when processor starts from RESET.

•修改并保存到XSAVE区域YMM寄存器的高位内容与XSAVE区域中保存的数据匹配. XSAVE/XRSTOR执行后,就会发生这种情况.

• Modified and saved to XSAVE region The content of the upper bits of YMM registers matches saved data in XSAVE region. This happens when after XSAVE/XRSTOR executes.

•已修改且未保存:执行一条AVX指令(256位或128位)会修改目标YMM的高位.

• Modified and Unsaved: The execution of one AVX instruction (either 256-bit or 128-bit) modifies the upper bits of the destination YMM.

每当处理器状态为修改且未保存"时,将执行AVX/SSE过渡惩罚.使用VZEROUPPER将处理器状态移至清理"并避免过渡损失.

The AVX/SSE transition penalty applies whenever the processor states is "Modified and Unsaved". Using VZEROUPPER move the processor states to "Clean" and avoid the transition penalty.

您的例程B( )弄脏了YMM状态,因此A( )中的SSE代码停顿了.在BA之间插入VZEROUPPER指令可避免此问题.

Your routine B( ) dirties the YMM state, so the SSE code in A( ) stalls. Insert a VZEROUPPER instruction between B and A to avoid the problem.

这篇关于使用AVX指令会禁用exp()优化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆