使用按位运算将Int转换为Float或将Float转换为Int(软件浮点) [英] Converting Int to Float or Float to Int using Bitwise operations (software floating point)

查看：499 发布时间：2020/9/7 2:20:01 assembly floating-point arm bit-manipulation ieee-754

本文介绍了使用按位运算将Int转换为Float或将Float转换为Int(软件浮点)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道您是否可以帮助解释将整数转换为float或将float转换为整数的过程.对于我的班级，我们仅使用按位运算符来执行此操作，但是我认为对类型之间的类型转换进行深入了解将在这个阶段为我提供更多帮助.

到目前为止，对于int浮点数，您将必须将整数转换为二进制，通过找到有效位数，指数和分数来对整数值进行归一化，然后从此处输出float值?

至于将float转换为int，您将必须将该值分成有效位数，指数和小数，然后反转上面的指令以获得int值?

我尝试按照此问题的说明进行操作:将float转换为int(按位)在C 中.
但是我并不能真正理解它.

还有，有人可以解释为什么在将int转换为float时，对于大于23位的值需要进行舍入吗?

解决方案

首先，如果您想更好地理解浮点数，则应该考虑阅读以下文章:每位计算机科学家应该了解的浮点算术知识"， http://www.validlab.com/goldberg/paper.pdf

现在吃点肉.

下面的代码是简单的事情，并尝试从unsigned int范围为0<的范围内产生IEEE-754单精度浮点数.值< 2 ²⁴.这是您最可能在现代硬件上遇到的格式，并且您似乎在原始问题中引用了这种格式.

IEEE-754单精度浮点数分为三个字段:单符号位，8位指数和23位有效数(有时称为尾数). IEEE-754使用 hidden 1 有效数字，这意味着有效数字实际上总共为24位.这些位从左到右排列，符号位在位31中，指数在位30..23中，有效数在位22..0中.Wikipedia的下图说明:

指数的偏差为127，这意味着与浮点数关联的实际指数比存储在指数字段中的值小127.因此，指数为0将被编码为127.

(注意:完整的Wikipedia文章可能对您来说很有趣.请参考: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )

因此，IEEE-754数字0x40000000解释如下:

位31 = 0:正值
位30 .. 23 = 0x80:指数= 128-127 = 1(又名2 ¹)
第22位..0全部为0:有效位数= 1.00000000_00000000_0000000. (请注意，我恢复了隐藏的1).

因此，值是1.0 x 2 ¹ = 2.0.

要将上面给出的有限范围内的unsigned int转换为IEEE-754格式的内容，可以使用下面的函数.它需要执行以下步骤:

将整数的前导1与浮点表示中 hidden 1的位置对齐.
对齐整数时，记录进行的总移位数.
掩盖隐藏的1.
使用转移的次数，计算指数并将其附加到数字上.
使用reinterpret_cast，将生成的位模式转换为float.这部分很丑陋，因为它使用了类型标记指针.您也可以通过滥用union来做到这一点.某些平台提供了固有的操作(例如_itof)，以使这种重新解释不太难看.

有很多更快的方法可以做到这一点；如果不是超级有效的话，这一方法在教学上是有用的:

 float uint_to_float(unsigned int significand)
{
    // Only support 0 < significand < 1 << 24.
    if (significand == 0 || significand >= 1 << 24)
        return -1.0;  // or abort(); or whatever you'd like here.

    int shifts = 0;

    //  Align the leading 1 of the significand to the hidden-1 
    //  position.  Count the number of shifts required.
    while ((significand & (1 << 23)) == 0)
    {
        significand <<= 1;
        shifts++;
    }

    //  The number 1.0 has an exponent of 0, and would need to be
    //  shifted left 23 times.  The number 2.0, however, has an
    //  exponent of 1 and needs to be shifted left only 22 times.
    //  Therefore, the exponent should be (23 - shifts).  IEEE-754
    //  format requires a bias of 127, though, so the exponent field
    //  is given by the following expression:
    unsigned int exponent = 127 + 23 - shifts;

    //  Now merge significand and exponent.  Be sure to strip away
    //  the hidden 1 in the significand.
    unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);


    //  Reinterpret as a float and return.  This is an evil hack.
    return *reinterpret_cast< float* >( &merged );
}

您可以使用检测数字前1的功能使此过程更有效. (这些名称有时用clz表示计数前导零"，或使用norm表示规范化".)

您还可以通过记录符号，获取整数的绝对值，执行上述步骤，然后将符号放入数字的第31位，将其扩展为带符号的数字.

对于大于等于2 ²⁴的整数，整个整数不适合32位浮点格式的有效数字段.这就是为什么您需要四舍五入"的原因:为了使值合适，您会丢失LSB.因此，多个整数最终将映射到相同的浮点模式.确切的映射取决于舍入模式(向-Inf舍入，向+ Inf舍入，向零舍入，向最接近的偶数舍入).但是事实是，您不能将24位减少到少于24位而不会造成任何损失.

您可以在上面的代码中看到这一点.它通过将前导1对齐到隐藏的1位置来工作.如果值> = 2 ²⁴，则代码需要向右移向右，而不向左移向左，这必然会使LSB移开.舍入模式仅告诉您如何处理移位的位.

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.

From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?

As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?

I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.

Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?

解决方案

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf

And now to some meat.

The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 2²⁴. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.

IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:

The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.

(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )

Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:

Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 2¹)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).

So the value is 1.0 x 2¹ = 2.0.

To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:

Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.

There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:

float uint_to_float(unsigned int significand)
{
    // Only support 0 < significand < 1 << 24.
    if (significand == 0 || significand >= 1 << 24)
        return -1.0;  // or abort(); or whatever you'd like here.

    int shifts = 0;

    //  Align the leading 1 of the significand to the hidden-1 
    //  position.  Count the number of shifts required.
    while ((significand & (1 << 23)) == 0)
    {
        significand <<= 1;
        shifts++;
    }

    //  The number 1.0 has an exponent of 0, and would need to be
    //  shifted left 23 times.  The number 2.0, however, has an
    //  exponent of 1 and needs to be shifted left only 22 times.
    //  Therefore, the exponent should be (23 - shifts).  IEEE-754
    //  format requires a bias of 127, though, so the exponent field
    //  is given by the following expression:
    unsigned int exponent = 127 + 23 - shifts;

    //  Now merge significand and exponent.  Be sure to strip away
    //  the hidden 1 in the significand.
    unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);


    //  Reinterpret as a float and return.  This is an evil hack.
    return *reinterpret_cast< float* >( &merged );
}

You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)

You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.

For integers >= 2²⁴, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.

You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 2²⁴, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

这篇关于使用按位运算将Int转换为Float或将Float转换为Int(软件浮点)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用按位运算将Int转换为Float或将Float转换为Int(软件浮点) [英] Converting Int to Float or Float to Int using Bitwise operations (software floating point)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用按位运算将Int转换为Float或将Float转换为Int(软件浮点) [英] Converting Int to Float or Float to Int using Bitwise operations (software floating point)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭