浮点运算以及x86和x64上下文 [英] float arithmetic and x86 and x64 context

查看:49
本文介绍了浮点运算以及x86和x64上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在VisualStudio进程上下文(x86上下文)和VisualStudio上下文(x64上下文)之外运行一些代码.我注意到以下代码在两种情况下都提供了不同的结果(x86中为100000000000,x64中为99999997952)

  float val = 1000f;val = val * val;返回(ulong)(val * 100000.0f); 

我们需要以可靠的方式从浮点值中获取ulong值,无论上下文和ulong值如何,它仅用于哈希目的.我在x64和x86上下文中测试了此代码,并确实获得了相同的结果,它看起来很可靠:

  float操作数Float =(float)obj;byte []个字节= BitConverter.GetBytes(operandFloat);Debug.Assert(bytes.Length == 4);uint @uint = BitConverter.ToUInt32(bytes,0);返回(ulong)@uint; 

此代码可靠吗?

解决方案

正如其他人在评论中推测的那样,您观察到的差异是进行浮点算术时差分精度的结果,这是由于两者之间的差异引起的.32位和64位版本如何执行这些操作.

您的代码由32位(x86)JIT编译器转换为以下目标代码:

  fld qword ptr ds:[0E63308h];将常量1.0e + 11加载到FPU堆栈的顶部.sub esp,8;分配8个字节的堆栈空间.fstp qword ptr [esp];弹出FPU堆栈的顶部,将1.0e + 11放入;在[esp]处分配的堆栈空间.致电73792C70;调用内部帮助程序方法,将其转换为;双精度浮点值存储在[esp];转换为64位整数,并以edx:eax返回.;此时,edx:eax == 100000000000. 

请注意,优化器已将算术运算((1000f * 1000f)* 100000f )折叠为常数1.0e + 11.它已将该常量存储在二进制文件的数据段中,并将其加载到x87浮点堆栈的顶部( fld 指令).然后,该代码通过对堆栈指针( esp )进行 sub 压缩来分配8个字节的堆栈空间(足以容纳64位双精度浮点值). fstp 指令将值从x87浮点堆栈的顶部弹出,并将其存储在其内存操作数中.在这种情况下,它将其存储到我们刚刚在堆栈上分配的8个字节中.所有这些改组都是毫无意义的:它可能只是将浮点常量1.0e + 11直接加载到内存中,而不是通过x87 FPU来回进行行程,但是JIT优化器并不完美.最后,JIT发出代码来调用内部帮助器函数,该函数将存储在内存(1.0e + 11)中的双精度浮点值转换为64位整数.按照32位Windows调用约定的惯例,在寄存器对 edx:eax 中返回64位整数结果.该代码完成后, edx:eax 包含64位整数值100000000000(即1.0e + 11),完全符合您的预期.

(希望这里的术语不太混乱.请注意,有两个 堆栈".x87 FPU有一系列寄存器,它们像堆栈一样访问.我指的是作为FPU堆栈.然后,您可能会熟悉其中的一个堆栈,该堆栈存储在主内存中,并可以通过堆栈指针 esp 进行访问.)


但是,64位(x86-64)JIT编译器的处理方式略有不同.此处的最大区别是64位目标始终使用SSE2指令进行浮点运算,因为所有支持AMD64的芯片也都支持SSE2,并且SSE2比旧的x87 FPU更高效,更灵活.具体来说,64位JIT会将您的代码转换为以下内容:

 <代码> movsd xmm0,mmword ptr [7FFF7B1A44D8h];将常量加载到XMM0寄存器中.致电00007FFFDAC253B0;调用内部帮助程序方法,将其转换为;XMM0中的浮点值转换为64位int;在RAX中返回. 

在这里事情立即出错了,因为第一条指令加载的常数值为0x42374876E0000000,这是99999997952.0的二进制浮点表示形式.问题是不是正在转换为64位整数的辅助函数.相反,它是JIT编译器本身,特别是优化器例程,用于预先计算常量.

要深入了解这是怎么回事,我们将关闭 JIT优化并查看代码如下:

  movss xmm0,dword ptr [7FFF7B1A4500h]movss dword ptr [rbp-4],xmm0movss xmm0,dword ptr [rbp-4]movss xmm1,dword ptr [rbp-4]mulss xmm0,xmm1mulss xmm0,dword ptr [7FFF7B1A4504h]cvtss2sd xmm0,xmm0致电00007FFFDAC253B0 

第一条 movss 指令将一个单精度浮点常量从内存加载到 xmm0 寄存器中.但是,这一次,该常数为0x447A0000,这是代码中的初始1000的精确二进制表示形式.

float 值.

第二条 movss 指令右转并将该值从 xmm0 寄存器存储到内存中,第三条 movss 指令将刚刚存储的值从内存中重新加载 xmm0 寄存器中.(告诉您这是未优化的代码!)它还将来自内存的该相同值的第二个副本加载到 xmm1 寄存器中,然后将两个单值乘以( ) xmm0 xmm1 中的精度值.这是您的 val = val * val 代码的字面翻译.正是您期望的那样,此操作的结果(以 xmm0 结尾)为0x49742400或1.0e + 6.

第二条 mulss 指令执行 val * 100000.0f 操作.它隐式加载单精度浮点常量1.0e + 5,并将其与 xmm0 中的值(回想起来是1.0e + 6)相乘.不幸的是,该操作的结果不是您所期望的.实际上是9.9999998e + 10,而不是1.0e + 11.为什么?因为不能将1.0e + 11精确地表示为单精度浮点值.最接近的表示形式是0x51BA43B7或9.9999998e + 10.

最后, cvtss2sd 指令将 xmm0 中的(错误!)标量单精度浮点值执行就地转换为标量双精度浮点值.Neitsa在对问题的评论中建议,这可能是问题的根源.实际上,正如我们所看到的,问题的根源是 previous 指令,该指令执行乘法运算. cvtss2sd 只是将已经不精确的单精度浮点表示形式(0x51BA43B7)转换为不精确的双精度浮点表示形式:0x42374876E0000000或99999997952.0.

这正是JIT编译器执行的一系列操作,以生成初始的双精度浮点常量,该常量以优化的代码加载到 xmm0 寄存器中.

尽管我在整个答案中一直暗示要归咎于JIT编译器,但事实并非如此!如果您在针对SSE2指令集时用C或C ++编译了相同的代码,则将获得完全相同的不精确结果:99999997952.0.如果将人们的期望正确地校准为不精确的浮点运算,那么JIT编译器的性能将与人们期望的一样!


那么,这个故事的寓意是什么?有两个.首先,浮点运算非常棘手,并且有很多需要了解他们.其次,鉴于此,在执行浮点算术时始终使用您拥有的最高精度

32位代码产生正确的结果,因为它使用双精度浮点值进行操作.可以使用64位,精确表示1.0e + 11.

64位代码产生错误的结果,因为它使用的是单精度浮点值.只能播放32位,因此 不可能精确表示1.0e + 11.

如果您使用 double 类型作为开头,则不会有此问题:

  double val = 1000.0;val = val * val;返回(ulong)(val * 100000.0); 

这确保了所有体系结构上的正确结果,而无需像问题中所建议的那样丑陋,不可移植的位处理黑客.(由于无法解决问题的根源,即无法以32位单精度 float 来直接表示您想要的结果,因此仍然不能保证正确的结果.)

即使您必须将输入作为单精度 float ,也要立即将其转换为 double ,然后在double-中进行所有后续的算术运算.精密空间.那仍然可以解决这个问题,因为初始值1000可以精确地表示为 float .

We are running some code in both VisualStudio process context (x86 context) and out of VisualStudio context (x64 context). I notice the following code provides a different result in both context (100000000000 in x86 and 99999997952 in x64)

float val = 1000f;
val = val * val;
return (ulong)(val * 100000.0f);

We need to obtain a ulong value from a float value in a reliable way, no matter the context and no matter the ulong value, it is just for hashing purpose. I tested this code in both x64 and x86 context and indeed obtained the same result, it looks reliable:

float operandFloat = (float)obj;
byte[] bytes = BitConverter.GetBytes(operandFloat);
Debug.Assert(bytes.Length == 4);
uint @uint = BitConverter.ToUInt32(bytes, 0);
return (ulong)@uint;

Is this code reliable?

解决方案

As others have speculated in the comments, the difference you're observing is the result of differential precision when doing floating-point arithmetic, arising out of a difference between how the 32-bit and 64-bit builds perform these operations.

Your code is translated by the 32-bit (x86) JIT compiler into the following object code:

fld   qword ptr ds:[0E63308h]  ; Load constant 1.0e+11 onto top of FPU stack.
sub   esp, 8                   ; Allocate 8 bytes of stack space.
fstp  qword ptr [esp]          ; Pop top of FPU stack, putting 1.0e+11 into
                               ;  the allocated stack space at [esp].
call  73792C70                 ; Call internal helper method that converts the
                               ;  double-precision floating-point value stored at [esp]
                               ;  into a 64-bit integer, and returns it in edx:eax.
                               ; At this point, edx:eax == 100000000000.

Notice that the optimizer has folded your arithmetic computation ((1000f * 1000f) * 100000f) to the constant 1.0e+11. It has stored this constant in the binary's data segment, and loads it onto the top of the x87 floating-point stack (the fld instruction). The code then allocates 8 bytes of stack space (enough for a 64-bit double-precision floating-point value) by subtracting the stack pointer (esp). The fstp instruction pops the value off the top of the x87 floating-point stack, and stores it in its memory operand. In this case, it stores it into the 8 bytes that we just allocated on the stack. All of this shuffling is rather pointless: it could have just loaded the floating-point constant 1.0e+11 directly into memory, by-passing the trip through the x87 FPU, but the JIT optimizer isn't perfect. Finally, the JIT emitted code to call an internal helper function that converts the double-precision floating-point value stored in memory (1.0e+11) into a 64-bit integer. The 64-bit integer result is returned in the register pair edx:eax, as is customary for 32-bit Windows calling conventions. When this code completes, edx:eax contains the 64-bit integer value 100000000000, or 1.0e+11, exactly as you would expect.

(Hopefully the terminology here is not too confusing. Note that there are two different "stacks". The x87 FPU has a series of registers, which are accessed like a stack. I refer to this as the FPU stack. Then, there is the stack with which you are probably familiar, the one stored in main memory and accessed via the stack pointer, esp.)


However, things are done a bit differently by the 64-bit (x86-64) JIT compiler. The big difference here is that 64-bit targets always use SSE2 instructions for floating-point operations, since all chips that support AMD64 also support SSE2, and SSE2 is more efficient and more flexible than the old x87 FPU. Specifically, the 64-bit JIT translates your code into the following:

movsd  xmm0, mmword ptr [7FFF7B1A44D8h]  ; Load constant into XMM0 register.
call   00007FFFDAC253B0                  ; Call internal helper method that converts the
                                         ;  floating-point value in XMM0 into a 64-bit int
                                         ;  that is returned in RAX.

Things immediately go wrong here, because the constant value being loaded by the first instruction is 0x42374876E0000000, which is the binary floating-point representation of 99999997952.0. The problem is not the helper function that is doing the conversion to a 64-bit integer. Instead, it is the JIT compiler itself, specifically the optimizer routine that is pre-computing the constant.

To gain some insight into how that goes wrong, we'll turn off JIT optimization and see what the code looks like:

movss    xmm0, dword ptr [7FFF7B1A4500h]  
movss    dword ptr [rbp-4], xmm0  
movss    xmm0, dword ptr [rbp-4]  
movss    xmm1, dword ptr [rbp-4]  
mulss    xmm0, xmm1  
mulss    xmm0, dword ptr [7FFF7B1A4504h]  
cvtss2sd xmm0, xmm0  
call     00007FFFDAC253B0 

The first movss instruction loads a single-precision floating-point constant from memory into the xmm0 register. This time, however, that constant is 0x447A0000, which is the precise binary representation of 1000—the initial float value from your code.

The second movss instruction turns right around and stores this value from the xmm0 register into memory, and the third movss instruction re-loads the just-stored value from memory back into the xmm0 register. (Told you this was unoptimized code!) It also loads a second copy of that same value from memory into the xmm1 register, and then multiplies (mulss) the two single-precision values in xmm0 and xmm1 together. This is the literal translation of your val = val * val code. The result of this operation (which ends up in xmm0) is 0x49742400, or 1.0e+6, precisely as you would expect.

The second mulss instruction performs the val * 100000.0f operation. It implicitly loads the single-precision floating-point constant 1.0e+5 and multiplies it with the value in xmm0 (which, recall, is 1.0e+6). Unfortunately, the result of this operation is not what you would expect. Instead of 1.0e+11, it is actually 9.9999998e+10. Why? Because 1.0e+11 cannot be precisely represented as a single-precision floating-point value. The closest representation is 0x51BA43B7, or 9.9999998e+10.

Finally, the cvtss2sd instruction performs an in-place conversion of the (wrong!) scalar single-precision floating-point value in xmm0 to a scalar double-precision floating-point value. In a comment to the question, Neitsa suggested that this might be the source of the problem. In fact, as we have seen, the source of the problem is the previous instruction, the one that does the multiplication. The cvtss2sd just converts an already imprecise single-precision floating-point representation (0x51BA43B7) to an imprecise double-precision floating point representation: 0x42374876E0000000, or 99999997952.0.

And this is precisely the series of operations performed by the JIT compiler to produce the initial double-precision floating-point constant that is loaded into the xmm0 register in the optimized code.

Although I have been implying throughout this answer that the JIT compiler is to blame, that is not the case at all! If you had compiled the identical code in C or C++ while targeting the SSE2 instruction set, you would have gotten exactly the same imprecise result: 99999997952.0. The JIT compiler is performing just as one would expect it to—if, that is, one's expectations are correctly calibrated to the imprecision of floating-point operations!


So, what is the moral of this story? There are two of them. First, floating-point operations are tricky and there is a lot to know about them. Second, in light of this, always use the most precision that you have available when doing floating-point arithmetic!

The 32-bit code is producing the correct result because it is operating with double-precision floating-point values. With 64 bits to play with, a precise representation of 1.0e+11 is possible.

The 64-bit code is producing the incorrect result because it is using single-precision floating-point values. With only 32 bits to play with, a precise representation of 1.0e+11 is not possible.

You would not have had this problem if you had used the double type to begin with:

double val = 1000.0;
val = val * val;
return (ulong)(val * 100000.0);

This ensures the correct result on all architectures, with no need for ugly, non-portable bit-manipulation hacks like those suggested in the question. (Which still cannot ensure the correct result, since it doesn't solve the root of the problem, namely that your desired result cannot be directly represented in a 32-bit single-precision float.)

Even if you have to take input as a single-precision float, convert it immediately into a double, and do all of your subsequent arithmetic manipulations in the double-precision space. That would still have solved this problem, since the initial value of 1000 can be precisely represented as a float.

这篇关于浮点运算以及x86和x64上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆