Java中的半精度浮点 [英] Half-precision floating-point in Java

查看:251
本文介绍了Java中的半精度浮点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么可以在 IEEE 754半精度数字或将它们转换为双精度和双精度?



这些方法都适合:


  • 保留半精度格式的数字并使用整数运算& (如 MicroFloat 为单精度和双精度所做的那样)

  • 以单精度或双精度执行所有计算,转换为/从半精度进行传输(在这种情况下,我需要的是经过充分测试的转换函数。)
>




编辑:转换需要100%准确 - 输入文件中的NaNs,infinities和subnormals。




相关的问题,但对于JavaScript:

解决您可以使用 Float.intBitsToFloat() Float.floatToIntBits()来转换他们往返于原始的浮动值。如果你可以生活在截断精度(而不是四舍五入),转换应该是可能的,只需要几个小时的转变。

现在我已经付出了一些努力进入它,结果并不像我预期的那么简单。这个版本现在已经在我可以想象的每一个方面进行了测试和验证,并且我非常有把握地确信它会为所有可能的输入值产生精确的结果。它支持在任何方向上精确的四舍五入和低于正常的转换。

pre $ //忽略较高的16位
public static float toFloat (int hbits)
{
int mant = hbits& 0x03FF的; // 10位尾数
int exp = hbits& 0x7c00; // 5位指数
if(exp == 0x7c00)// NaN / Inf
exp = 0x3fc00; // - > NaN / Inf
else if(exp!= 0)//归一化的值
{
exp + = 0x1c000;如果(mant == 0&& exp> 0x1c400)//平滑过渡
返回Float.intBitsToFloat((hbits& 0x8000)<< 16
| exp<< 13 | 0x3ff);
}
else if(mant!= 0)//&& exp == 0 - > subnormal
{
exp = 0x1c400; //使之正常
do {
mant << = 1; //尾数* 2
exp - = 0x400; ((mant& 0x400)== 0); //将exp减1
}; //不正常
mant& = 0x3ff; //丢弃subnormal bit
} // else +/- 0 - > +/- 0
返回Float.intBitsToFloat(//将所有部分
(hbits& 0x8000)<<< 16 // sign<<(31-15)
| (exp(<<< 13); //值<< (23 - 10)
}






  //返回所有高16位为0的所有结果
public static int fromFloat(float fval)
{
int fbits = Float.floatToIntBits(fval );
int sign = fbits>>> 16&为0x8000; //仅标记
int val =(fbits& 0x7fffffff)+ 0x1000; //取整取值

if(val> = 0x47800000)//可能成为NaN / Inf
{//避免Inf因四舍五入
if((fbits& amp ; 0x7fffffff)> = 0x47800000)
{//是或必须成为NaN / Inf
if(val< 0x7f800000)//是值,但是太大
return sign | 0x7c00; //使其成为+/- Inf
返回符号| 0x7c00 | //保持+/- Inf或NaN
(fbits& 0x007fffff)>>> 13; //保持NaN(和Inf)位
}
返回符号| 0x7bff;如果(val> = 0x38800000)//保持标准化值
return sign | val - 0x38000000>>> 13;如果(val< 0x33000000)//对于低于正常值的
,返回符号太小; // exp - 127 + 15
if //变成+/- 0
val =(fbits& 0x7fffffff)>>> 23; // tmp exp for subnormal calc
return sign | ((fbits& 0x7fffff | 0x800000)//根据截止值
>>> 126,添加低位正常位
+(0x800000>>> val - 102) val); // div by 2 ^(1-(exp-127 + 15)),>> 13 | exp = 0

$ / code $


我实现了两个小的扩展, / em>是因为16位浮点数的一般精度很低,与大浮点类型相比,浮点格式的内在异常可以在视觉上感知,而浮点类型通常由于精度不够而不被注意。
$ b

第一个是 toFloat()函数中的这两行:

<$如果(mant == 0&& exp> 0x1c400)//平滑过渡
返回Float.intBitsToFloat((hbits& 0x8000)<< 16 | exp << 13 | 0x3ff);

在正常范围内的浮点数大小采用指数,因此大小的精度的价值。但是,这不是一个顺利的通过,它发生在步骤:切换到下一个更高的指数导致一半的精度。对于尾数的所有值,精度现在保持不变,直到下一次跳到下一个更高的指数为止。上面的扩展代码通过返回一个值在这个特定的半浮点值的覆盖的32位浮点范围的地理中心,使这些转换更平滑。每个正常的半浮点值映射到8192个32位浮点值。返回的值应该恰好在这些值的中间。但在半浮指数的过渡期,较低的4096值的精度是4096的较高值的两倍,因此覆盖的数字空间只有另一侧的一半。所有这些8192 32位浮点值映射到相同的半浮点值,因此将半浮点数转换为32位,返回结果为相同的半浮点值,而不管8192 intermediate 32位值中的哪一个选择。现在,扩展名的结果是,在过渡处显示了一个更平滑的半步(如右下图所示),而左边的图片应该是可视化锐化步骤的两倍而不会产生反锯齿。您可以安全地从代码中移除这两行来获得标准行为。

pre $ 在返回值的两边覆盖了数字空间:
6.0E-8 ####### ##########
4.5E-8 | #
3.0E-8 ######### ########

第二个扩展名位于 fromFloat()函数中:

 <$ c如果((fbits& 0x7fffffff)> = 0x47800000)$ b $ {//避免Inf因四舍五入

...
return sign | 0x7bff; // unrounded not inf
}

这个扩展略微延伸了一半的数字范围浮点格式通过保存一些32位值形式升级到无穷大。受影响的值是那些比没有舍入的Infinity小的值,并且由于四舍五入而变成无穷大。你可以安全地删除上面显示的行,如果你不想要这个扩展。

我尝试优化中的正常值的路径fromFloat ()函数尽可能多,这使得由于使用预先计算的和不移动的常量而使其可读性稍差。我没有把太多的努力放在toFloat(),因为它不会超过查找表的性能。所以如果速度真的很重要的话,可以使用 toFloat()函数来填充一个带有0x10000元素的静态查找表,而不是使用这个表来进行实际的转换。现在的x64服务器虚拟机的速度提高了3倍,x86客户端虚拟机的速度提高了5倍。



我把代码写入公有领域。 >

Is there a Java library anywhere that can perform computations on IEEE 754 half-precision numbers or convert them to and from double-precision?

Either of these approaches would be suitable:

  • Keep the numbers in half-precision format and compute using integer arithmetic & bit-twiddling (as MicroFloat does for single- and double-precision)
  • Perform all computations in single or double precision, converting to/from half precision for transmission (in which case what I need is well-tested conversion functions.)

Edit: conversion needs to be 100% accurate - there are lots of NaNs, infinities and subnormals in the input files.


Related question but for JavaScript: Decompressing Half Precision Floats in Javascript

解决方案

You can Use Float.intBitsToFloat() and Float.floatToIntBits() to convert them to and from primitive float values. If you can live with truncated precision (as opposed to rounding) the conversion should be possible to implement with just a few bit shifts.

I have now put a little more effort into it and it turned out not quite as simple as I expected at the beginning. This version is now tested and verified in every aspect I could imagine and I'm very confident that it produces the exact results for all possible input values. It supports exact rounding and subnormal conversion in either direction.

// ignores the higher 16 bits
public static float toFloat( int hbits )
{
    int mant = hbits & 0x03ff;            // 10 bits mantissa
    int exp =  hbits & 0x7c00;            // 5 bits exponent
    if( exp == 0x7c00 )                   // NaN/Inf
        exp = 0x3fc00;                    // -> NaN/Inf
    else if( exp != 0 )                   // normalized value
    {
        exp += 0x1c000;                   // exp - 15 + 127
        if( mant == 0 && exp > 0x1c400 )  // smooth transition
            return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16
                                            | exp << 13 | 0x3ff );
    }
    else if( mant != 0 )                  // && exp==0 -> subnormal
    {
        exp = 0x1c400;                    // make it normal
        do {
            mant <<= 1;                   // mantissa * 2
            exp -= 0x400;                 // decrease exp by 1
        } while( ( mant & 0x400 ) == 0 ); // while not normal
        mant &= 0x3ff;                    // discard subnormal bit
    }                                     // else +/-0 -> +/-0
    return Float.intBitsToFloat(          // combine all parts
        ( hbits & 0x8000 ) << 16          // sign  << ( 31 - 15 )
        | ( exp | mant ) << 13 );         // value << ( 23 - 10 )
}


// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
    int fbits = Float.floatToIntBits( fval );
    int sign = fbits >>> 16 & 0x8000;          // sign only
    int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

    if( val >= 0x47800000 )               // might be or become NaN/Inf
    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
        {                                 // is or must become NaN/Inf
            if( val < 0x7f800000 )        // was value but too large
                return sign | 0x7c00;     // make it +/-Inf
            return sign | 0x7c00 |        // remains +/-Inf or NaN
                ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
        }
        return sign | 0x7bff;             // unrounded not quite Inf
    }
    if( val >= 0x38800000 )               // remains normalized value
        return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
    if( val < 0x33000000 )                // too small for subnormal
        return sign;                      // becomes +/-0
    val = ( fbits & 0x7fffffff ) >>> 23;  // tmp exp for subnormal calc
    return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
         + ( 0x800000 >>> val - 102 )     // round depending on cut off
      >>> 126 - val );   // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

I implemented two small extensions compared to the book because the general precision for 16 bit floats is rather low which could make the inherent anomalies of floating point formats visually perceivable compared to larger floating point types where they are usually not noticed due to the ample precision.

The first one are these two lines in the toFloat() function:

if( mant == 0 && exp > 0x1c400 )  // smooth transition
    return Float.intBitsToFloat( ( hbits & 0x8000 ) << 16 | exp << 13 | 0x3ff );

Floating point numbers in the normal range of the type size adopt the exponent and thus the precision to the magnitude of the value. But this is not a smooth adoption, it happens in steps: switching to the next higher exponent results in half the precision. The precision now remains the same for all values of the mantissa until the next jump to the next higher exponent. The extension code above makes these transitions smoother by returning a value that is in the geographical center of the covered 32 bit float range for this particular half float value. Every normal half float value maps to exactly 8192 32 bit float values. The returned value is supposed to be exactly in the middle of these values. But at the transition of the half float exponent the lower 4096 values have twice the precision as the upper 4096 values and thus cover a number space that is only half as large as on the other side. All these 8192 32 bit float values map to the same half float value, so converting a half float to 32 bit and back results in the same half float value regardless of which of the 8192 intermediate 32 bit values was chosen. The extension now results in something like a smoother half step by a factor of sqrt(2) at the transition as shown at the right picture below while the left picture is supposed to visualize the sharp step by a factor of two without anti aliasing. You can safely remove these two lines from the code to get the standard behavior.

covered number space on either side of the returned value:
       6.0E-8             #######                  ##########
       4.5E-8             |                       #
       3.0E-8     #########               ########

The second extension is in the fromFloat() function:

    {                                     // avoid Inf due to rounding
        if( ( fbits & 0x7fffffff ) >= 0x47800000 )
...
        return sign | 0x7bff;             // unrounded not quite Inf
    }

This extension slightly extends the number range of the half float format by saving some 32 bit values form getting promoted to Infinity. The affected values are those that would have been smaller than Infinity without rounding and would become Infinity only due to the rounding. You can safely remove the lines shown above if you don't want this extension.

I tried to optimize the path for normal values in the fromFloat() function as much as possible which made it a bit less readable due to the use of precomputed and unshifted constants. I didn't put as much effort into 'toFloat()' since it would not exceed the performance of a lookup table anyway. So if speed really matters could use the toFloat() function only to fill a static lookup table with 0x10000 elements and than use this table for the actual conversion. This is about 3 times faster with a current x64 server VM and about 5 times faster with the x86 client VM.

I put the code hereby into public domain.

这篇关于Java中的半精度浮点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆