如何在C双无限表示? [英] How is infinity represented in a C double?

查看:176
本文介绍了如何在C双无限表示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从计算机系统:程序员的观点一书中了解到,IEEE标准要求使用以下64位二进制格式表示双精度浮点数:




  • s:1位为符号

  • exp:11位为指数

  • 压裂:52位分数


+ infinity表示为一个特殊值,具有以下模式:


  • s = 0
  • 所有exp位均为1 b $ b $ all所有小数位为0 / li>


我认为完整的64位for double应该按以下顺序:
$ b $所以我写下面的C代码来验证它:
$ b $($)

  //检查无限
double x1 =(double)0x7ff0000000000000; //这应该是+无穷
double x2 =(double)0x7ff0000000000001; //注意额外的结尾1,x2应该是NaN
printf(\\\
x1 =%f,x2 =%f sizeof(double)=%d,x1,x2,sizeof(x2));
if(x1 == x2)
printf(\\\
x1 == x2);
else
printf(\\\
x1!= x2);

但结果是:

<$ p $ (double)= 8
x1 == x2

$ b $ x1 = 9218868437227405300.000000,x2 = 9218868437227405300.000000 sizeof b

为什么数字是一个有效的数字,而不是一些无穷大的错误?

为什么x1 == x2?$ / b
$ b

(我正在使用MinGW GCC编译器。)

ADD 1



I

  //检查无穷大和NaN 
unsigned长长的x1 = 0x7ff0000000000000ULL; // + infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN double
double y1 = *((double *)(& x1));
double y2 = *((double *)(& x2));
double y3 = *((double *)(& x3));

printf(\\\
sizeof(long long)=%d,sizeof(x1));
printf(\ nx1 =%f,x2 =%f,x3 =%f,x1,x2,x3); //%f足够输出
printf(\\\
y1 =%f,y2 =%f,y3 =%f,y1,y2,y3);

结果是:

pre > sizeof(long long)= 8
x1 = 1.#INF00,x2 = -1。#INF00,x3 = 1.#SNAN0
y1 = 1.#INF00, y2 = -1。#INF00,y3 = 1.#QNAN0

详细的输出看起来有点奇怪,但我认为这一点很明确。



PS:看起来指针转换是没有必要的。只要使用%f 来告诉 printf 函数来解释 unsigned long long double 格式中的c $ c>变量。
$ b

ADD 2



出于好奇,我用下面的代码检查了变量的位置。

  typedef unsigned char * byte_pointer ; 

void show_bytes(byte_pointer start,int len)
{
int i;
for(i = len-1; i> = 0; i--)
{
printf(%。2x,start [i]);
}
printf(\\\
);
}

我尝试了下面的代码:

  //检查无穷大和NaN 
unsigned long long x1 = 0x7ff0000000000000ULL; // + infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN double
double y1 = *((double *)(& x1));
double y2 = *((double *)(& x2));
double y3 = *((double *)(& x3));

unsigned long long x4 = x1 + x2; //我想检查(+ infinity)+( - 无穷大)
double y4 = y1 + y2; //我想检查(+ infinity)+( - 无穷大)

printf(\\\\\\\\\\\\':);
show_bytes((byte_pointer)& x1,sizeof(x1));
printf(\\\
x2:);
show_bytes((byte_pointer)& x2,sizeof(x2));
printf(\\\
x3:);
show_bytes((byte_pointer)& x3,sizeof(x3));
printf(\\\
x4:);
show_bytes((byte_pointer)& x4,sizeof(x4));

printf(\\\
y1:);
show_bytes((byte_pointer)& y1,sizeof(y1));
printf(\\\
y2:);
show_bytes((byte_pointer)& y2,sizeof(y2));
printf(\\\
y3:);
show_bytes((byte_pointer)& y3,sizeof(y3));
printf(\\\
y4:);
show_bytes((byte_pointer)& y4,sizeof(y4));

输出结果为:

  x1:7ff0000000000000 

x2:fff0000000000000

x3:7ff0000000000001

x4:7fe0000000000000

y1:7ff0000000000000

y2:fff0000000000000

y3:7ff8000000000001
$ b $ y4:fff8000000000000 // <==与x4不同

奇怪的部分是,尽管x1和x2与y1和y2具有相同的位模式,但和x4不同y4。





  printf(\\\
y4 =% f,y4);

给出了这个:

  y4 = -1。#IND00 //这是什么意思? 

他们为什么不同?如何获得y4?

解决方案

首先, 0x7ff0000000000000 确实是双无限的位表示。但是转换不会设置位表示,它会将 0x7ff0000000000000 的逻辑值转换为64位整数。所以,您需要使用其他方式来设置位模式。



设置位模式的直接方法是

  uint64_t位= 0x7ff0000000000000; 
double infinity = *(double *)& bits;

然而,这是未定义的行为。 C标准禁止读取值( uint64_t )作为另一个基本类型( double )存储。这被称为严格别名规则,并且允许编译器生成更好的代码,因为它可以假定一种类型的读取顺序和另一种类型的写入顺序是不相关的。



这个规则的唯一例外是 char 类型:您明确允许将任何指针转换为 char * 然后回来。所以你可以尝试使用这个代码:

$ p code char bits [] = {0x7f,0xf0,0,0,0,0 ,0,0};
double infinity = *(double *)bits;

即使这不是未定义的行为,它仍然是实现定义的行为 double 中的字节顺序取决于您的机器。给定的代码在一个像ARM和Power家族这样的大型机器上工作,而不是在X86上。对于X86,你需要这个版本:

$ p code> char bits [] = {0,0,0,0,0,0, 0xf0,0x7f};
double infinity = *(double *)bits;

这个实现定义的行为真的没有办法,因为不能保证机器会存储浮动点值与整数值的顺序相同。甚至还有一些机器使用像这样的字节顺序:< 1,0,3,2>我甚至不想知道是谁提出了这个好主意,但它是存在的,我们必须忍受它。 p>




回到最后一个问题:浮点运算本质上与整数运算不同。这些位具有特殊的含义,浮点单元考虑到这一点。特别是像infinities,NANs和非规范化数字这样的特殊值被以特殊的方式处理。由于 + inf + -inf 被定义为产生一个NAN,所以你的浮点单元发出一个NAN的位模式。整数单元不知道无穷或NAN,所以它只是将位模式解释为一个巨大的整数,并愉快地执行整数加法(在这种情况下发生溢出)。由此产生的位模式不是NAN的。它正好是一个真正巨大的正浮点数的位模式(准确地说),但没有任何意义。

实际上,有一种方法可以用一种可移植的方式设置除NAN以外的所有值的位模式:给定三个变量,包含签名,指数和尾数,你可以这样做:

  uint64_t sign = ...,exponent = ...,尾数= ...; 
双重结果;
assert(!(exponent == 0x7ff&& mantissa)); //不能以这种方式设置NAN的位。
if(exponent){
//此代码不适用于非规格化数字。当指数信号NAN或无穷大时,它不会兑现尾数的值。
result = mantissa +(1ull <= 52); //添加隐含位。
result / =(1ull <= 52); //这确保指数在逻辑上为零(等于偏差),以便下一个操作按预期工作。
结果* = pow(2,(double)((signed)exponent - 0x3ff)); //这个设置指数。
} else {
//此代码适用于非规格化数字。
结果=尾数; //没有隐含的位
result / =(1ull <= 51); / /这确保下一个操作按预期工作。
结果* = pow(2,-0x3ff); //缩小到非规范化的范围。
}
结果* =(sign?-1.0:1.0); //这设置标志。

这使用浮点单位本身将位移动到正确的位置。由于无法使用浮点运算与NAN的尾数位进行交互,所以在该代码中不可能包括NAN的产生。那么,你可以生成一个NAN,但你不能控制其尾数位模式。

I learned from the book Computer Systems: A Programmer's Perspective that the IEEE standard requires the double precision floating number to be represented using the following 64-bit binary format:

  • s: 1 bit for sign
  • exp: 11 bits for exponent
  • frac: 52 bits for fraction

The +infinity is represented as a special value with the following pattern:

  • s = 0
  • all exp bits are 1
  • all fraction bits are 0

And I think the full 64-bit for double should be in the following order:

(s)(exp)(frac)

So I write the following C code to verify it:

//Check the infinity
double x1 = (double)0x7ff0000000000000;  // This should be the +infinity
double x2 = (double)0x7ff0000000000001; //  Note the extra ending 1, x2 should be NaN
printf("\nx1 = %f, x2 = %f sizeof(double) = %d", x1,x2, sizeof(x2));
if (x1 == x2)
    printf("\nx1 == x2");
else
    printf("\nx1 != x2");

But the result is:

x1 = 9218868437227405300.000000, x2 = 9218868437227405300.000000 sizeof(double) = 8
x1 == x2

Why is the number a valid number rather than some infinity error?

Why x1==x2?

(I am using the MinGW GCC compiler.)

ADD 1

I modified the code as below and the validated the Infinity and NaN successfully.

//Check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 =* ((double *)(&x3));

printf("\nsizeof(long long) = %d", sizeof(x1));
printf("\nx1 = %f, x2 = %f, x3 = %f", x1, x2, x3); // %f is good enough for output
printf("\ny1 = %f, y2 = %f, y3 = %f", y1, y2, y3);

The result is:

sizeof(long long) = 8
x1 = 1.#INF00, x2 = -1.#INF00, x3 = 1.#SNAN0
y1 = 1.#INF00, y2 = -1.#INF00, y3 = 1.#QNAN0

The detailed output looks a bit strange, but I think the point is clear.

PS.: It seems the pointer conversion is not necessary. Just use %f to tell the printf function to interpret the unsigned long long variable in double format.

ADD 2

Out of curiosity, I checked the bit represetation of the variables with the following code.

typedef unsigned char *byte_pointer;

void show_bytes(byte_pointer start, int len)
{
    int i;
    for (i = len-1; i>=0; i--)
    {
        printf("%.2x", start[i]);
    }
    printf("\n");
}

And I tried the code below:

//check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 = *((double *)(&x3));

unsigned long long x4 = x1 + x2;  // I want to check (+infinity)+(-infinity)
double y4 = y1 + y2; // I want to check (+infinity)+(-infinity)

printf("\nx1: ");
show_bytes((byte_pointer)&x1, sizeof(x1));
printf("\nx2: ");
show_bytes((byte_pointer)&x2, sizeof(x2));
printf("\nx3: ");
show_bytes((byte_pointer)&x3, sizeof(x3));
printf("\nx4: ");
show_bytes((byte_pointer)&x4, sizeof(x4));

printf("\ny1: ");
show_bytes((byte_pointer)&y1, sizeof(y1));
printf("\ny2: ");
show_bytes((byte_pointer)&y2, sizeof(y2));
printf("\ny3: ");
show_bytes((byte_pointer)&y3, sizeof(y3));
printf("\ny4: ");
show_bytes((byte_pointer)&y4, sizeof(y4));

The output is:

x1: 7ff0000000000000

x2: fff0000000000000

x3: 7ff0000000000001

x4: 7fe0000000000000

y1: 7ff0000000000000

y2: fff0000000000000

y3: 7ff8000000000001

y4: fff8000000000000  // <== Different with x4

The strange part is, though x1 and x2 have the identical bit pattern as y1 and y2, the sum x4 is different from y4.

And

printf("\ny4=%f", y4);

gives this:

y4=-1.#IND00  // What does it mean???

Why are they different? And how is y4 obtained?

解决方案

First, 0x7ff0000000000000 is indeed the bit representation of a double infinity. But the cast does not set the bit representation, it converts the logical value of 0x7ff0000000000000 interpreted as a 64 bit integer. So, you need to use some other way to set the bit pattern.

The straightforward way to set the bit pattern would be

uint64_t bits = 0x7ff0000000000000;
double infinity = *(double*)&bits;

However, this is undefined behavior. The C standard forbids reading a value that has been stored as one fundamental type (uint64_t) as another fundamental type (double). This is known as strict aliasing rules, and allows the compiler to generate better code because it can assume that the order of the read of one type and a write of another type is irrelevant.

The only exception to this rule is the char types: You are explicitly allowed to cast any pointer to a char* and back. So you could try to use this code:

char bits[] = {0x7f, 0xf0, 0, 0, 0, 0, 0, 0};
double infinity = *(double*)bits;

Even though this is not undefined behavior anymore, it is still implementation defined behavior: The order of the bytes in a double depends on your machine. The given code works on a big endian machine like ARM and the Power family, but not on X86. For the X86 you need this version:

char bits[] = {0, 0, 0, 0, 0, 0, 0xf0, 0x7f};
double infinity = *(double*)bits;

There is really no way around this implementation defined behavior since there is no guarantee that a machine will store floating point values in the same order as integer values. There are even machines that use byte orders like this: <1, 0, 3, 2> I don't even want to know who came up with this brilliant idea, but it exists and we have to live with it.


To your last question: floating point arithmetic is inherently different from integer arithmetic. The bits have special meanings, and the floating point unit takes that into account. Especially the special values like infinities, NANs, and denormalized numbers are treated in a special way. And since +inf + -inf is defined to yield a NAN, your floating point unit emits the bit pattern of a NAN. The integer unit does not know about infinities or NAN, so it just interpretes the bit pattern as a huge integer and happily performs an integer addition (which happens to overflow in this case). The resulting bit pattern is not that of a NAN. It happens to be the bit pattern of a really huge, positive floating point number (2^1023, to be precise), but that bears no meaning.


Actually, there is a way to set the bit patterns of all values except NANs in a portable way: Given three variables containing the bits of the sign, exponent, and mantissa, you can do this:

uint64_t sign = ..., exponent = ..., mantissa = ...;
double result;
assert(!(exponent == 0x7ff && mantissa));    //Can't set the bits of a NAN in this way.
if(exponent) {
    //This code does not work for denormalized numbers. And it won't honor the value of mantissa when the exponent signals NAN or infinity.
    result = mantissa + (1ull << 52);    //Add the implicit bit.
    result /= (1ull << 52);    //This makes sure that the exponent is logically zero (equals the bias), so that the next operation will work as expected.
    result *= pow(2, (double)((signed)exponent - 0x3ff));    //This sets the exponent.
} else {
    //This code works for denormalized numbers.
    result = mantissa;    //No implicit bit.
    result /= (1ull << 51);    //This ensures that the next operation works as expected.
    result *= pow(2, -0x3ff);    //Scale down to the denormalized range.
}
result *= (sign ? -1.0 : 1.0);    //This sets the sign.

This uses the floating point unit itself to move the bits into the right place. Since there is no way to interact with the mantissa bits of a NAN using floating point arithmetic, it is not possible to include the generation of NANs in this code. Well, you could generate a NAN, but you'd have no control on its mantissa bit pattern.

这篇关于如何在C双无限表示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆