平台无关的方法来降低浮点常量值的精度 [英] platform independent way to reduce precision of floating point constant values

查看:88
本文介绍了平台无关的方法来降低浮点常量值的精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用例:

我有一些包含浮点常量的大型数据数组. 生成定义该数组的文件,并且可以轻松修改模板.

I have some large data arrays containing floating point constants that. The file defining that array is generated and the template can be easily adapted.

我想做一些测试,降低精度对质量和二进制压缩率的影响.

I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.

由于除了生成的文件外,我不想更改其他源代码,因此我正在寻找降低常量精度的方法.

Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.

我想将尾数限制为固定的位数(将较低的位数设置为0).但是由于浮点文字是十进制的,所以存在一些困难,即以二进制表示形式在低尾数位中包含全零的方式指定数字.

I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.

最好的情况是:

#define FP_REDUCE(float)  /* some macro  */

static const float32_t veryLargeArray[] = {
  FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
  // ... 
};

#undef FP_REDUCE

这应该在编译时完成,并且应该与平台无关.

This should be done at compile time and it should be platform independent.

推荐答案

以下使用 Veltkamp-Dekker分割算法 x 中删除 n 位(四舍五入),其中 p = 2 n (例如,要删除八个位,请在第二个参数中使用0x1p8f).强制转换为float32_t强制将结果强制转换为该类型,因为C标准否则允许实现在表达式中使用更高的精度. (理论上,双取整可能会产生错误的结果,但是当float32_t是IEEE基本的32位二进制格式,并且C实现以该格式或64位格式或更宽的格式(例如,前者是理想的格式,后者足够宽,可以准确地表示中间结果.)

The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)

假定采用IEEE-754二进制浮点,且取整为最近.如果 x •( p +1)舍入为无穷大,则会发生溢出.

IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.

#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))

这篇关于平台无关的方法来降低浮点常量值的精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆