Float32到Float16 [英] Float32 to Float16

查看:368
本文介绍了Float32到Float16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以向我解释如何将32位浮点值转换为16位浮点值吗? m =尾数)

如果32位浮点数是1s7e24m

而16位浮点数是1s5e10m

然后就这么简单吗?

  int fltInt32; 
short fltInt16;
memcpy(& fltInt32,& flt,sizeof(float));

fltInt16 =(fltInt32& 0x00FFFFFF)>> 14;
fltInt16 | =((fltInt32& 0x7f000000)>> 26)< 10;
fltInt16 | =((fltInt32& 0x80000000)>> 16);

我假设这不是那么简单...所以任何人都可以告诉我你做了什么需要做什么?

编辑:我看到我的指数转移错了...所以这会更好吗?

  fltInt16 =(fltInt32& 0x007FFFFF)>> 13; 
fltInt16 | =(fltInt32& 0x7c000000)>> 13;
fltInt16 | =(fltInt32& 0x80000000)>> 16;

我希望这是正确的。道歉,如果我失去了一些显而易见的话。它在星期五晚上几乎是午夜...所以我不是完全清醒的;)

编辑2:Ooops。又犯了一个错误我想失去前三位而不是更低!那么怎么样:

  fltInt16 =(fltInt32& 0x007FFFFF)>> 13; 
fltInt16 | =(fltInt32& 0x0f800000)>> 13;
fltInt16 | =(fltInt32& 0x80000000)>> 16;

最终的代码应该是
$ ((fltInt32& 0x7fffffff)>> 13) - (0x38000000>> 13); b $ b

  fltInt16 = 
fltInt16 | =((fltInt32& 0x80000000)>> 16);


解决方案

float32和float16表示中的指数可能有偏差,并有不同的偏见。你需要取消你从float32表达式获得的指数来得到实际的指数,然后偏向float16表示。



除了这个细节外,认为这是简单的,但我仍然不时浮动表示感到惊讶。

编辑:


  1. 检查指数是否在指数中时发生溢出。

  2. 算法有点突然地截断了mantisa的最后几位,这可能是可以接受的,但是你可能想通过查看即将被丢弃的位来实现,比如说,圆整到最近。 0 ... - >向下舍入,100..001 ... - >向上舍入,100..00 - >舍入到偶数。



Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?

(s = sign e = exponent and m = mantissa)

If 32-bit float is 1s7e24m
And 16-bit float is 1s5e10m

Then is it as simple as doing?

int     fltInt32;
short   fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );

fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);

I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?

Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

I'm hoping this is correct. Apologies if I'm missing something obvious that has been said. Its almost midnight on a friday night ... so I'm not "entirely" sober ;)

Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

Final code should be:

fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

解决方案

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

  1. Check for overflow when doing the thing with the exponents while you're at it.

  2. Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even.

这篇关于Float32到Float16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆