如何在x64代码中获取exp()函数的内在函数? [英] How can I get an intrinsic for the exp() function in x64 code?

查看:200
本文介绍了如何在x64代码中获取exp()函数的内在函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,希望使用exp()函数的内部版本.不幸的是,它不在x64版本中,因此它比类似的Win32(即32位版本)要慢:

I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):

#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>

int main()
{
  const int NUM_ITERATIONS=10000000;
  double expNum=0.00001;
  double result=0.0;

  for (double i=0;i<NUM_ITERATIONS;++i)
  {
    result+=exp(expNum); // <-- The code of interest is here
    expNum+=0.00001;
  }

  // To prevent the above from getting optimized out...
  std::cout << result << '\n';
}

我正在使用以下开关进行构建:

I am using the following switches for my build:

/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" 
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- 
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope 
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\" 
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue 

如您所见,按照

As you can see, I do have /Oi, /O2 and /fp:fast as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp() perform slower on x64 builds.

这是生成的程序集:

  for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030  movsd      xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]  
000000013F911039  movapd     xmm8,xmm6  
000000013F91103E  movapd     xmm7,xmm9  
000000013F911043  movaps     xmmword ptr [rsp+20h],xmm11  
000000013F911049  movsd      xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]  
  {
    result+=exp(expNum);
000000013F911052  movapd     xmm0,xmm7  
000000013F911056  call       exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B  addsd      xmm8,xmm10  
    expNum+=0.00001;
000000013F911060  addsd      xmm7,xmm9  
000000013F911065  comisd     xmm8,xmm11  
000000013F91106A  addsd      xmm6,xmm0  
000000013F91106E  jb         main+52h (13F911052h)  
  }

如您在上面的程序集中所看到的,有一个exp()函数的调用.现在,让我们看看为for循环使用32位版本生成的代码:

As you can see in the assembly above, there is a call out to the exp() function. Now, let's look at the code generated for that for loop with a 32-bit build:

  for (double i=0;i<NUM_ITERATIONS;++i)
00101031  xorps       xmm1,xmm1  
00101034  rdtsc  
00101036  push        ebx  
00101037  push        esi  
00101038  movsd       mmword ptr [esp+1Ch],xmm0  
0010103E  movsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101046  push        edi  
00101047  mov         ebx,eax  
00101049  mov         dword ptr [esp+3Ch],edx  
0010104D  movsd       mmword ptr [esp+28h],xmm0  
00101053  movsd       mmword ptr [esp+30h],xmm1  
00101059  lea         esp,[esp]  
  {
    result+=exp(expNum);
00101060  call        __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065  addsd       xmm0,mmword ptr [esp+20h]  
0010106B  movsd       xmm1,mmword ptr [esp+30h]  
00101071  addsd       xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]  
00101079  movsd       xmm2,mmword ptr [__real@416312d000000000 (102178h)]  
00101081  comisd      xmm2,xmm1  
00101085  movsd       mmword ptr [esp+20h],xmm0  
    expNum+=0.00001;
0010108B  movsd       xmm0,mmword ptr [esp+28h]  
00101091  addsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101099  movsd       mmword ptr [esp+28h],xmm0  
0010109F  movsd       mmword ptr [esp+30h],xmm1  
001010A5  ja          wmain+40h (101060h)  
  }

更多代码,但速度更快.我在3.3 GHz Nehalem-EP主机上进行的时序测试得出以下结果:

Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:

32位:

对于循环主体平均执行时间:34.849229周期/10.560373 ns

For loop body average exec time: 34.849229 cycles / 10.560373 ns

64位:

对于循环主体平均执行时间:45.845323周期/13.892522 ns

For loop body average exec time: 45.845323 cycles / 13.892522 ns

确实是非常奇怪的行为.为什么会发生?

Very odd behavior, indeed. Why is it happening?

更新:

Update:

我创建了一个 Microsoft Connect错误报告.可以随意对其进行投票,以获取有关浮点内在函数的使用的Microsoft权威性答案,尤其是在x64代码中.

I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.

推荐答案

在x64上,使用SSE执行浮点算术.这没有exp()的内置操作,因此除非您编写自己的内联手动向量化的__m128d exp(__m128d)(

On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp() and so a call to the standard library is inevitable unless you write your own inline manually-vectorized __m128d exp(__m128d) (Fastest Implementation of Exponential Function Using SSE).

我想象您所指的MSDN文章是使用32位代码编写的,并且考虑到了8087 FP.

I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.

这篇关于如何在x64代码中获取exp()函数的内在函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆