什么是浮点溢出和下溢? [英] What is overflow and underflow in floating point

查看:508
本文介绍了什么是浮点溢出和下溢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我觉得我不太了解溢出下溢的概念。我正在问这个问题来澄清这一点。我需要用最基本的层面来理解它。让我们使用 1 byte - 1 位符号的简化浮点表示, 3 位指数和 4 位尾数:

  0 000 0000 

我们可以存储的最大指数是 111_2 = 7 减去给定 4 的偏差 K = 2 ^ 2-1 = 3 ,它是保留的为 Infinity NaN 。最大数字的指数是 3 ,它是 110 在二进制偏移量下。



所以最大数字的位模式是:

$ $ $ $ $ $ $ $ $ $ $ $ $ 110 1111 //负数

当指数为零时,数字是次正规的并且隐含 0 而不是 1 。因此,最小数字的位模式是:

  0 000 0001 //正数
1 000 0001 //负数

我发现这些单精度浮点的描述:
$ (2-2-23)×2127(负溢出)
负数大于-2-149(负下溢)
pre $ 负数小于 - b $ b正数小于2-149(正下溢)
正数大于(2-2-23)×2127(正溢出)

其中,我只理解正溢出,导致 + Infinity ,而举例如下:

  0 110 1111 + 0 110 1111 = 0 111 0000 

解决方案

当然以下是实现但是如果这些数字的行为与IEEE-754规定的一样,那么浮点数不会溢出并下溢到像整数这样的非常不正确的答案。你真的不应该结束两个正数相乘,导致负数。

相反,溢出意味着结果太大而无法表示。根据舍入模式的不同,这通常可以用最大浮点数(RTZ)或Inf(RNE)来表示:

  0 110 1111 * 0 110 1111 = 0 111 0000 

(注意,如你所知,整数溢出在硬件上通过应用类似的钳制操作避免了这种情况,这只是惯例)。

在处理浮点数时,术语下溢意味着数'太小而不能表示',通常只会导致0.0:

  0 000 0001 * 0 000 0001 = 0 000 0000 

请注意,我也听说下溢这个术语被用来溢出到一个非常大的负数,但是这个是不是最好的术语。这是一个例子,当结果是否定的,太大而不能表示的时候,也就是'负溢流':

  0 110 1111 * 1 110 1111 = 1 111 0000 


I feel I don't really understand the concept of overflow and underflow. I'm asking this question to clarify this. I need to understand it at its most basic level with bits. Let's work with the simplified floating point representation of 1 byte - 1 bit sign, 3 bits exponent and 4 bits mantissa:

0 000 0000

The max exponent we can store is 111_2=7 minus the bias K=2^2-1=3 which gives 4, and it's reserved for Infinity and NaN. The exponent for max number is 3, which is 110 under offset binary.

So the bit pattern for max number is:

0 110 1111 // positive
1 110 1111 // negative

When the exponent is zero, the number is subnormal and has implicit 0 instead of 1. So the bit pattern for min number is:

0 000 0001 // positive
1 000 0001 // negative

I've found these descriptions for single-precision floating point:

Negative numbers less than −(2−2−23) × 2127 (negative overflow)
Negative numbers greater than −2−149 (negative underflow)
Positive numbers less than 2−149 (positive underflow)
Positive numbers greater than (2−2−23) × 2127 (positive overflow)

Out of them I understand only positive overflow which results in +Infinity, and the example would be like this:

0 110 1111 + 0 110 1111 = 0 111 0000 

Can anyone please demonstrate the three other cases for overflow and underflow using the bit patterns I outlined above?

解决方案

Of course the following is implementation dependent, but if the numbers behave anything like what IEEE-754 specifies, Floating point numbers do not overflow and underflow to a wildly incorrect answer like integers do, e.g. you really should not end up with two positive numbers being multiplied resulting in a negative number.

Instead, overflow would mean that the result is 'too large to represent'. Depending on the rounding mode, this either usually gets represented by max float(RTZ) or Inf (RNE):

0 110 1111 * 0 110 1111 = 0 111 0000

(Note that the overflowing of integers as you know it could have been avoided in hardware by applying a similar clamping operation, it's just not the convention to do that.)

When dealing with floating point numbers the term underflow means that the number is 'too small to represent', which usually just results in 0.0:

0 000 0001 * 0 000 0001 = 0 000 0000

Note that I have also heard the term underflow being used for overflow to a very large negative number, but this is not the best term for it. This is an example of when the result is negative and too large to represent, i.e. 'negative overflow':

0 110 1111 * 1 110 1111 = 1 111 0000

这篇关于什么是浮点溢出和下溢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆