OpenCL浮点精度 [英] OpenCL Floating point precision

查看:268
本文介绍了OpenCL浮点精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现OpenCL中的主机 - 客户端浮点标准有问题。问题是,在x86中编译时,Opencl计算的浮点数与Visual Studio 2010编译器不在同一浮点数限制。
然而,当在x64中编译时,它们在相同的限制。我知道它必须是, http://www.viva64.com/en/b / 0074 /



我在测试期间使用的来源是: http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism
当我运行程序在x86它会给我202个数字是相等的,当内核和C ++程序取平方的1269760数字。但是在64位构建中,1269760的数字是对的,换句话说100%。
此外,我发现opencl和x86 c ++的计算结果之间的误差是5.5385384e-014,这是一个非常小的分数但不够小,与数字的epsilon相比,这是2.92212543378266922312416e -19。

这是因为,错误需要小于epsilon,以便程序可以将两个数字识别为一个相同的数字。当然,通常情况下,永远不会比较浮动原生,但很高兴知道浮动限制是不同的。是的,我试图设置flt:static,但得到相同的错误。



所以我想对这种行为的一些解释。
预先感谢所有答案。

解决方案

由于在将代码从x86到x64,它都必须做如何在CPU上执行乘法。在x86和x64模式中浮点数处理之间有一些细微的差别,最大的一个是,因为任何x64 CPU也支持SSE和SSE2,它默认用于Windows 64位模式下的数学运算。 p>

HD4770 GPU使用单精度浮点单位进行所有计算。现代x64 CPU具有两种处理浮点数的功能单元:




  • x87 FPU,精度为80位

  • SSE FPU以32位和64位精度运行,并且与其他CPU如何处理浮点数非常兼容



    • 在32位模式下,编译器不假定SSE可用,并生成通常的x87 FPU代码来进行数学运算。在这种情况下,诸如 data [i] * data [i] 的操作在内部使用高得多的80位精度执行。比较 if(results [i] == data [i] * data [i])执行如下:




      • data [i] 使用 FLD DWORD PTR data [i] ]
      • 使用计算
      • data [i] * data [i] FMUL DWORD PTR data [i]

      • result [i] FLD DWORD PTR result [i]

      • 两个值都使用 FUCOMPP



      这里有问题。 data [i] * data [i] 驻留在80位精度的x87 FPU堆栈元素中。 result [i] 来自GPU的32位精度。这两个数字很可能不同,因为 data [i] * data [i] 有更多有效数字,而 result [i] 有很多零(在80位精度)!



      在64位模式下,事情发生在另一种方式。编译器知道您的CPU具有SSE能力,它使用SSE指令来执行数学。在x64上执行相同的比较语句:




      • data [i] 使用 MOVSS XMM0,DWORD PTR data [i]

      • data [使用 MULSS XMM0,DWORD PTR data [i]

      • 计算数据[i] 使用 MOVSS XMM1,DWORD PTR result [i]
      • 将结果[i]
      • 两个值都使用 UCOMISS XMM1,XMM0


      $ b进行比较。 p>在这种情况下,使用与GPU上使用的相同的32位单点精度执行平方运算。不会生成具有80位精度的中间结果。这就是为什么结果是一样的。



      这是很容易实际测试,即使没有涉及到GPU。只需运行以下简单程序:

        #include< stdlib.h> 
      #include< stdio.h>

      float mysqr(float f)
      {
      f * = f;
      return f;
      }

      int main(void)
      {
      int i,n;
      float f,f2;

      srand(1);
      for(i = n = 0; n <1000000; n ++)
      {
      f = rand()/(float)RAND_MAX;
      if(mysqr(f)!= f * f)i ++;
      }
      printf(%d of%d个方块不同\\\
      ,i);
      return 0;
      }

      mysqr 以便中间的80位结果将以32位精度 float 进行转换。如果你编译并运行在64位模式,输出是:

       1000000平方的不同

      如果以32位模式编译并运行,输出为:

        999845 of 1000000 square differences 

      原则上,您应该能够在32位模式下更改浮点模型(项目属性 - >配置属性 - > C / C ++ - >代码生成 - >浮点模型),但这样做没有什么变化,因为至少在VS2010中间结果仍然保存在FPU中。您可以做的是强制执行计算的平方的存储和重新加载,以便在将其与GPU的结果进行比较之前将其舍入为32位精度。在上面的简单例子中,这是通过改变:

        if(mysqr(f)!= f * f)i ++; 

        if(mysqr(f)!=(float)(f * f))i ++; 

      更改后32位代码输出变为:

        0 1000000个方块不同


      I found a problem with host - client float standard in OpenCL. The problem was that the floating points calculated by Opencl is not in the same floating point limits as my visual studio 2010 compiler, when compiling in x86. However when compiling in x64 they are in the same limit. I know it has to be something with, http://www.viva64.com/en/b/0074/

      The source I used during testing was: http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism When i ran the program in x86 it would give me 202 numbers that was equal, when the kernel and the C++ program took square of 1269760 numbers. However in 64 bit build, 1269760 numbers was right, in other words 100 %. Furthermore, I found that the error between the calculated result of opencl and x86 c++, was 5.5385384e-014, which is a very small fraction but not small enough, compared to the epsilon of the number, which was 2.92212543378266922312416e-19.
      That's because, the error needs to be smaller than the epsilon, so that the program can recognize the two numbers as one single equal number. Of course normally one would never compare floats natively, but it is good to know that the float limits are different. And yes i tried to set flt:static, but got the same error.

      So I want some sort of explanation for this behavior. Thanks in advance for all answers.

      解决方案

      Since nothing changes in the GPU code as you switch your project from x86 to x64, it all has to do as how multiplication is performed on the CPU. There are some subtle differences between floating-point numbers handling in x86 and x64 modes and the biggest one is that since any x64 CPU also supports SSE and SSE2, it is used by default for math operations in 64-bit mode on Windows.

      The HD4770 GPU does all computations using single-precision floating point units. Modern x64 CPUs on the other hand have two kinds of functional units that handle floating point numbers:

      • x87 FPU which operates with the much higher extended precision of 80 bits
      • SSE FPU which operates with 32-bit and 64-bit precision and is much compatible with how other CPUs handle floating point numbers

      In 32-bit mode the compiler does not assume that SSE is available and generates usual x87 FPU code to do the math. In this case operations like data[i] * data[i] are performed internally using the much higher 80-bit precision. Comparison of the kind if (results[i] == data[i] * data[i]) is performed as follows:

      • data[i] is pushed onto the x87 FPU stack using the FLD DWORD PTR data[i]
      • data[i] * data[i] is computed using FMUL DWORD PTR data[i]
      • result[i] is pushed onto the x87 FPU stack using FLD DWORD PTR result[i]
      • both values are compared using FUCOMPP

      Here comes the problem. data[i] * data[i] resides in an x87 FPU stack element in 80-bit precision. result[i] comes from the GPU in 32-bit precision. Both numbers will most likely differ since data[i] * data[i] has much more significant digits whereas result[i] has lots of zeros (in 80-bit precision)!

      In 64-bit mode things happen in another way. The compiler knows that your CPU is SSE capable and it uses SSE instructions to do the math. The same comparison statement is performed in the following way on x64:

      • data[i] is loaded into an SSE register using MOVSS XMM0, DWORD PTR data[i]
      • data[i] * data[i] is computed using MULSS XMM0, DWORD PTR data[i]
      • result[i] is loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i]
      • both values are compared using UCOMISS XMM1, XMM0

      In this case the square operation is performed with the same 32-bit single point precision as is used on the GPU. No intermediate results with 80-bit precision are generated. That's why results are the same.

      It is very easy to actually test this even without GPU being involved. Just run the following simple program:

      #include <stdlib.h>
      #include <stdio.h>
      
      float mysqr(float f)
      {
          f *= f;
          return f;
      }
      
      int main (void)
      {
          int i, n;
          float f, f2;
      
          srand(1);
          for (i = n = 0; n < 1000000; n++)
          {
              f = rand()/(float)RAND_MAX;
              if (mysqr(f) != f*f) i++;
          }
          printf("%d of %d squares differ\n", i);
          return 0;
      }
      

      mysqr is specifically written so that the intermediate 80-bit result will get converted in 32-bit precision float. If you compile and run in 64-bit mode, output is:

      0 of 1000000 squares differ
      

      If you compile and run in 32-bit mode, output is:

      999845 of 1000000 squares differ
      

      In principle you should be able to change the floating point model in 32-bit mode (Project properties -> Configuration Properties -> C/C++ -> Code Generation -> Floating Point Model) but doing so changes nothing since at least on VS2010 intermediate results are still kept in the FPU. What you can do is to enforce store and reload of the computed square so that it will be rounded to 32-bit precision before it is compared with the result from the GPU. In the simple example above this is achieved by changing:

      if (mysqr(f) != f*f) i++;
      

      to

      if (mysqr(f) != (float)(f*f)) i++;
      

      After the change 32-bit code output becomes:

      0 of 1000000 squares differ
      

      这篇关于OpenCL浮点精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆