两个单精度浮点向量的点积在CUDA内核中产生的结果不同于主机上的结果 [英] Dot product of two single-precision floating point vectors yields different results in CUDA kernel than on the host

查看:285
本文介绍了两个单精度浮点向量的点积在CUDA内核中产生的结果不同于主机上的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在调试一些CUDA代码时,我正在使用 printf 语句对等效CPU代码进行比较,并注意到在某些情况下,我的结果不同;他们在两个平台上都不一定是错误的,因为他们在浮点舍入误差范围内,但是我仍然有兴趣知道是什么导致了这种差异。



我能够将问题追踪到不同的点积结果。在CUDA和主机代码中,我都有向量a和b,类型为 float4 。然后,在每个平台上,我使用下面的代码计算点积和打印结果:

  printf(a:% .24f\t%.24f\t%.24f\t%.24f\\\
,AX,AY,AZ,AW);
printf(b:%.24f\t%.24f\t%.24f\t%.24f\\\
,b.x,b.y,b.z,b.w);
float dot_product = a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w;
printf(a dot b:%.24f\\\
,dot_product);

结果为CPU打印输出:

  A:0.999629139900207519531250 -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740 
B:-0.001840781536884605884552 0.033134069293737411499023 0.988499701023101806640625 1.000000000000000000000000
A点b:-0.001397025771439075469971


和CUDA内核:

  a: 0.999629139900207519531250 -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740 
B:-0.001840781536884605884552 0.033134069293737411499023 0.988499701023101806640625 1.000000000000000000000000
A点b:-0.001397024840116500854492

正如您所看到的,a和b的值在两个平台上似乎都是按位等效的,但是完全相同代码的结果却有所不同。我的理解是浮点乘法是按照IEEE 754标准定义的,并且与硬件无关。然而,我有两个假设,为什么我没有看到相同的结果:


  1. 编译器优化是重新排序乘法,并且它们在GPU / CPU上以不同的顺序发生,导致不同的结果。

  2. CUDA内核使用融合乘法加法(FMA)运算符,如 http://developer.download.nvidia.com/assets/ CUDA /文件/ NVIDIA-CUDA的浮动Point.pdf 。在这种情况下,CUDA结果实际上应该更精确一些。


解决方案除了将FMUL和FADD合并到FMA(可以使用nvcc命令行开关 -fmad = false ),CUDA编译器遵循C / C ++规定的评估顺序。根据你的CPU代码的编译方式,它可以使用比单精度更宽的精度来累积点积,然后产生不同的结果。

对于GPU代码,将FMUL / FADD合并到FMA中是经常发生的,因此产生的数字差异也是如此。出于性能原因,CUDA编译器执行积极的FMA合并。 FMA的使用通常也会产生更准确的结果,因为舍入步骤的数量减少,并且由于FMA在内部维护全宽度产品,所以对于减法消除有一些保护。我建议阅读下面的白皮书以及它引用的参考文献:

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA- Floating-Point.pdf

为了让CPU和GPU结果匹配进行完整性检查,您需要关闭GPU中的FMA合并代码与 -fmad = false ,在CPU强制每个中间结果以单精度存储:

  volatile float p0,p1,p2,p3,dot_product; 
p0 = a.x * b.x;
p1 = a.y * b.y;
p2 = a.z * b.z;
p3 = a.w * b.w;
dot_product = p0;
dot_product + = p1;
dot_product + = p2;
dot_product + = p3;


While debugging some CUDA code I was comparing to equivalent CPU code using printf statements, and noticed that in some cases my results differed; they weren't necessarily wrong on either platform, as they were within floating point rounding errors, but I am still interested in knowing what gives rise to this difference.

I was able to track the problem down to differing dot product results. In both the CUDA and host code I have vectors a and b of type float4. Then, on each platform, I compute the dot product and print the result, using this code:

printf("a: %.24f\t%.24f\t%.24f\t%.24f\n",a.x,a.y,a.z,a.w);
printf("b: %.24f\t%.24f\t%.24f\t%.24f\n",b.x,b.y,b.z,b.w);
float dot_product = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
printf("a dot b: %.24f\n",dot_product);

and the resulting printout for the CPU is:

a: 0.999629139900207519531250   -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552  0.033134069293737411499023  0.988499701023101806640625  1.000000000000000000000000
a dot b: -0.001397025771439075469971

and for the CUDA kernel:

a: 0.999629139900207519531250   -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552  0.033134069293737411499023  0.988499701023101806640625  1.000000000000000000000000
a dot b: -0.001397024840116500854492

As you can see, the values for a and b seem to be bitwise equivalent on both platforms, but the result of the exact same code differs ever so slightly. It is my understanding that floating point multiplication is well-defined as per the IEEE 754 Standard and is hardware-independent. However, I do have two hypotheses as to why I am not seeing the same results:

  1. The compiler optimization is re-ordering the multiplications, and they happen in a different order on GPU/CPU, giving rise to different results.
  2. The CUDA kernel is using the fused multipl-add (FMA) operator, as described in http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf. In this case, the CUDA results should actually be a bit more accurate.

解决方案

Except for merging FMUL and FADD into FMA (which can be turned off with the nvcc command line switch -fmad=false), the CUDA compiler observes the evaluation order prescribed by C/C++. Depending on how your CPU code is compiled, it may use a wider precision than single precision to accumulate the dot product, which then yields a different result.

For GPU code, merging of FMUL/FADD into FMA is a common occurrence, so are the resulting numerical differences. The CUDA compiler performs aggressive FMA merging for performance reasons. Use of FMA usually also results in more accurate results, since the number of rounding steps is reduced, and there is some protection against subtractive cancellation as FMA maintains the full-width product internally. I would suggest reading the following whitepaper, as well as the references it cites:

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

To get the CPU and GPU results to match for a sanity check, you would want to turn off FMA-merging in the GPU code with -fmad=false, and on the CPU enforce that each intermediate result is stored in single precision:

   volatile float p0,p1,p2,p3,dot_product; 
   p0=a.x*b.x; 
   p1=a.y*b.y; 
   p2=a.z*b.z; 
   p3=a.w*b.w; 
   dot_product=p0; 
   dot_product+=p1; 
   dot_product+=p2; 
   dot_product+=p3;

这篇关于两个单精度浮点向量的点积在CUDA内核中产生的结果不同于主机上的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆