快速浮点量化，按精度缩放? [英] Fast float quantize, scaled by precision?

查看：63 发布时间：2020/11/8 21:48:25 c floating-point

本文介绍了快速浮点量化，按精度缩放?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于浮点精度会针对较大的值而降低，因此在某些情况下，可能需要根据其大小对值进行量化-而不是根据绝对值进行量化.

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.

天真的方法可能是检测精度并按比例放大:

A naive approach could be to detect the precision and scale it up:

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

但是这看起来太沉重了.

However this seems too heavy.

相反，应该有可能掩盖浮游螳螂中的位模拟诸如强制转换为16位浮点数，然后返回-例如.

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.

不是浮动位专家，我不能说得出的浮动位是否有效(或需要规范化)

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)

对于速度来说，当关于舍入的确切行为不重要时，考虑到浮点的大小，量化浮点数的快速方法是什么?

For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?

推荐答案

Veltkamp-Dekker拆分算法会将浮点数拆分为高低部分.示例代码如下.

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.

如果有效位数中有 s 位(IEEE 754 64位二进制文件中为53)，并且以下代码中的值Scale为2 ^b，然后*x0接收x的高 s - b 位，而*x1接收其余的位，您可以将其丢弃(或从下面的代码中删除，因此永远不会计算出来).如果 b 在编译时是已知的，例如常量43，则可以将Scale替换为适当的常量，例如0x1p43.否则，您必须以某种方式产生2 ^b.

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2^b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2^b in some way.

这需要四舍五入到最近的模式. IEEE 754算术就足够了，但是其他合理的算术也可以.它使关系变得平整.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.

这假定x * (Scale + 1)没有溢出.必须以与要分离的值相同的精度来评估运算. (对于double的double，对于float的float，依此类推.如果编译器使用double计算float表达式，则将中断.一种解决方法是将输入转换为最宽的浮点数. -point类型支持，以该类型执行拆分[Scale进行相应调整]，然后转换回去.)

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

这篇关于快速浮点量化，按精度缩放?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速浮点量化，按精度缩放? [英] Fast float quantize, scaled by precision?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

快速浮点量化，按精度缩放? [英] Fast float quantize, scaled by precision?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭