如何在x64代码中获取exp()函数的内在函数? [英] How can I get an intrinsic for the exp() function in x64 code?
问题描述
我有以下代码,希望使用exp()
函数的内部版本.不幸的是,它不在x64版本中,因此它比类似的Win32(即32位版本)要慢:
I have the following code and am expecting the intrinsic version of the exp()
function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):
#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>
int main()
{
const int NUM_ITERATIONS=10000000;
double expNum=0.00001;
double result=0.0;
for (double i=0;i<NUM_ITERATIONS;++i)
{
result+=exp(expNum); // <-- The code of interest is here
expNum+=0.00001;
}
// To prevent the above from getting optimized out...
std::cout << result << '\n';
}
我正在使用以下开关进行构建:
I am using the following switches for my build:
/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG"
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm-
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\"
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue
As you can see, I do have /Oi
, /O2
and /fp:fast
as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp()
perform slower on x64 builds.
这是生成的程序集:
for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030 movsd xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]
000000013F911039 movapd xmm8,xmm6
000000013F91103E movapd xmm7,xmm9
000000013F911043 movaps xmmword ptr [rsp+20h],xmm11
000000013F911049 movsd xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]
{
result+=exp(expNum);
000000013F911052 movapd xmm0,xmm7
000000013F911056 call exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B addsd xmm8,xmm10
expNum+=0.00001;
000000013F911060 addsd xmm7,xmm9
000000013F911065 comisd xmm8,xmm11
000000013F91106A addsd xmm6,xmm0
000000013F91106E jb main+52h (13F911052h)
}
如您在上面的程序集中所看到的,有一个exp()
函数的调用.现在,让我们看看为for
循环使用32位版本生成的代码:
As you can see in the assembly above, there is a call out to the exp()
function. Now, let's look at the code generated for that for
loop with a 32-bit build:
for (double i=0;i<NUM_ITERATIONS;++i)
00101031 xorps xmm1,xmm1
00101034 rdtsc
00101036 push ebx
00101037 push esi
00101038 movsd mmword ptr [esp+1Ch],xmm0
0010103E movsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]
00101046 push edi
00101047 mov ebx,eax
00101049 mov dword ptr [esp+3Ch],edx
0010104D movsd mmword ptr [esp+28h],xmm0
00101053 movsd mmword ptr [esp+30h],xmm1
00101059 lea esp,[esp]
{
result+=exp(expNum);
00101060 call __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065 addsd xmm0,mmword ptr [esp+20h]
0010106B movsd xmm1,mmword ptr [esp+30h]
00101071 addsd xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]
00101079 movsd xmm2,mmword ptr [__real@416312d000000000 (102178h)]
00101081 comisd xmm2,xmm1
00101085 movsd mmword ptr [esp+20h],xmm0
expNum+=0.00001;
0010108B movsd xmm0,mmword ptr [esp+28h]
00101091 addsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]
00101099 movsd mmword ptr [esp+28h],xmm0
0010109F movsd mmword ptr [esp+30h],xmm1
001010A5 ja wmain+40h (101060h)
}
更多代码,但速度更快.我在3.3 GHz Nehalem-EP主机上进行的时序测试得出以下结果:
Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:
32位:
对于循环主体平均执行时间:34.849229周期/10.560373 ns
For loop body average exec time: 34.849229 cycles / 10.560373 ns
64位:
对于循环主体平均执行时间:45.845323周期/13.892522 ns
For loop body average exec time: 45.845323 cycles / 13.892522 ns
确实是非常奇怪的行为.为什么会发生?
Very odd behavior, indeed. Why is it happening?
更新:
Update:
我创建了一个 Microsoft Connect错误报告.可以随意对其进行投票,以获取有关浮点内在函数的使用的Microsoft权威性答案,尤其是在x64代码中.
I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.
推荐答案
在x64上,使用SSE执行浮点算术.这没有exp()
的内置操作,因此除非您编写自己的内联手动向量化的__m128d exp(__m128d)
(
On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp()
and so a call to the standard library is inevitable unless you write your own inline manually-vectorized __m128d exp(__m128d)
(Fastest Implementation of Exponential Function Using SSE).
我想象您所指的MSDN文章是使用32位代码编写的,并且考虑到了8087 FP.
I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.
这篇关于如何在x64代码中获取exp()函数的内在函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!