快速浮点量化,按精度缩放? [英] Fast float quantize, scaled by precision?
问题描述
由于浮点精度会针对较大的值而降低,因此在某些情况下,可能需要根据其大小对值进行量化-而不是根据绝对值进行量化.
Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.
天真的方法可能是检测精度并按比例放大:
A naive approach could be to detect the precision and scale it up:
float quantize(float value, float quantize_scale) {
float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
return floorf((value / factor) + 0.5f) * factor;
}
但是这看起来太沉重了.
However this seems too heavy.
相反,应该有可能掩盖浮游螳螂中的位 模拟诸如强制转换为16位浮点数,然后返回-例如.
Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.
不是浮动位专家,我不能说得出的浮动位是否有效(或需要规范化)
Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)
对于速度来说,当关于舍入的确切行为不重要时,考虑到浮点的大小,量化浮点数的快速方法是什么?
For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?
推荐答案
Veltkamp-Dekker拆分算法会将浮点数拆分为高低部分.示例代码如下.
The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.
如果有效位数中有 s 位(IEEE 754 64位二进制文件中为53),并且以下代码中的值Scale
为2 b ,然后*x0
接收x
的高 s - b 位,而*x1
接收其余的位,您可以将其丢弃(或从下面的代码中删除,因此永远不会计算出来).如果 b 在编译时是已知的,例如常量43,则可以将Scale
替换为适当的常量,例如0x1p43
.否则,您必须以某种方式产生2 b .
If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale
in the code below is 2b, then *x0
receives the high s-b bits of x
, and *x1
receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale
with the appropriate constant, such as 0x1p43
. Otherwise, you must produce 2b in some way.
这需要四舍五入到最近的模式. IEEE 754算术就足够了,但是其他合理的算术也可以.它使关系变得平整.
This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.
这假定x * (Scale + 1)
没有溢出.必须以与要分离的值相同的精度来评估运算. (对于double
的double
,对于float
的float
,依此类推.如果编译器使用double
计算float
表达式,则将中断.一种解决方法是将输入转换为最宽的浮点数. -point类型支持,以该类型执行拆分[Scale
进行相应调整],然后转换回去.)
This assumes that x * (Scale + 1)
does not overflow. The operations must be evaluated in the same precision as the value being separated. (double
for double
, float
for float
, and so on. If the compiler evaluates float
expressions with double
, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale
adjusted correspondingly], and then convert back.)
void Split(double *x0, double *x1, double x)
{
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;
}
这篇关于快速浮点量化,按精度缩放?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!