精度与Sum浮点减和内核 [英] Precision in Sum reduction kernel with floats
问题描述
我创建了一个调用Nvidia的Sum Reduction内核(reduce6)的例程,但是当我比较CPU和GPU之间的结果时,会出现一个随着向量大小增加而增加的错误:
I am creating a routine that calls the Sum Reduction kernel of Nvidia (reduction6), but when I compare the results between the CPU and GPU get an error that increases as the vector size increases, so:
CPU和GPU降低都是浮动的
Both CPU and GPU reductions are floats
Size: 1024 (Blocks : 1, Threads : 512)
Reduction on CPU: 508.1255188
Reduction on GPU: 508.1254883
Error: 6.0059137e-06
Size: 16384 (Blocks : 8, Threads : 1024)
Reduction on CPU: 4971.3193359
Reduction on GPU: 4971.3217773
Error: 4.9109825e-05
Size: 131072 (Blocks : 64, Threads : 1024)
Reduction on CPU: 49986.6718750
Reduction on GPU: 49986.8203125
Error: 2.9695415e-04
Size: 1048576 (Blocks : 512, Threads : 1024)
Reduction on CPU: 500003.7500000
Reduction on GPU: 500006.8125000
Error: 6.1249541e-04
关于此错误的任何想法?,谢谢。
Any idea about this error?, thanks.
推荐答案
Floating point addition is not necessarily associative.
这意味着当你改变操作顺序时的你的浮点求和,你可能会得到不同的结果。通过定义并行求和将改变求和的操作顺序。
This means that when you change the order of operations of your floating-point summation, you may get different results. Parallelizing a summation by definition changes the order of operations of the summation.
有许多方法可以求和浮点数,并且每个方法对不同的输入分布都有精确的好处。 这是一个体面的调查。
There are many ways to sum floating-point numbers, and each has accuracy benefits for different input distributions. Here's a decent survey.
给定顺序中的顺序求和很少是最准确的求和方式,因此如果这是您要比较的对象,请不要期望它与基于树的求和在典型的并行缩减中使用。
Sequential summation in the given order is rarely the most accurate way to sum, so if that is what you are comparing against, don't expect it to compare well to the tree-based summation used in a typical parallel reduction.
这篇关于精度与Sum浮点减和内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!