精度与Sum浮点减和内核 [英] Precision in Sum reduction kernel with floats

查看:154
本文介绍了精度与Sum浮点减和内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个调用Nvidia的Sum Reduction内核(reduce6)的例程,但是当我比较CPU和GPU之间的结果时,会出现一个随着向量大小增加而增加的错误:

I am creating a routine that calls the Sum Reduction kernel of Nvidia (reduction6), but when I compare the results between the CPU and GPU get an error that increases as the vector size increases, so:

CPU和GPU降低都是浮动的

Both CPU and GPU reductions are floats

Size: 1024  (Blocks : 1,  Threads : 512)
Reduction on CPU:  508.1255188 
Reduction on GPU:  508.1254883 
Error:  6.0059137e-06

Size: 16384 (Blocks : 8, Threads : 1024)
Reduction on CPU:  4971.3193359 
Reduction on GPU:  4971.3217773 
Error:  4.9109825e-05

Size: 131072 (Blocks : 64, Threads : 1024)
Reduction on CPU:  49986.6718750 
Reduction on GPU:  49986.8203125 
Error:  2.9695415e-04

Size: 1048576 (Blocks : 512, Threads : 1024)
Reduction on CPU:  500003.7500000 
Reduction on GPU:  500006.8125000 
Error:  6.1249541e-04

关于此错误的任何想法?,谢谢。

Any idea about this error?, thanks.

推荐答案

浮点加法不一定是关联的。

Floating point addition is not necessarily associative.

这意味着当你改变操作顺序时的你的浮点求和,你可能会得到不同的结果。通过定义并行求和将改变求和的操作顺序。

This means that when you change the order of operations of your floating-point summation, you may get different results. Parallelizing a summation by definition changes the order of operations of the summation.

有许多方法可以求和浮点数,并且每个方法对不同的输入分布都有精确的好处。 这是一个体面的调查

There are many ways to sum floating-point numbers, and each has accuracy benefits for different input distributions. Here's a decent survey.

给定顺序中的顺序求和很少是最准确的求和方式,因此如果这是您要比较的对象,请不要期望它与基于树的求和在典型的并行缩减中使用。

Sequential summation in the given order is rarely the most accurate way to sum, so if that is what you are comparing against, don't expect it to compare well to the tree-based summation used in a typical parallel reduction.

这篇关于精度与Sum浮点减和内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆