为什么高精度浮点格式具有这么多指数位? [英] Why do higher-precision floating point formats have so many exponent bits?

查看:115
本文介绍了为什么高精度浮点格式具有这么多指数位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究浮点格式,既 IEEE 754 x87 .这是一个摘要:

                Total       Bits per field
Precision       Bits    Sign  Exponent  Mantissa
Single          32      1     8         23  (+1 implicit)   
Double          64      1     11        52  (+1 implicit)
Extended (x87)  80      1     15        64
Quadruple       128     1     15        112 (+1 implicit) 

我的问题是,为什么高精度格式有这么多指数位?单精度为您提供的最大值约为10 ^ 38,在极端情况下(宇宙中的原子数),您可能需要更大的指数.但是双精度达到〜10 ^ 308,而扩展精度和四精度具有更多的指数位.这似乎比实际的硬件加速计算所必需的要大得多. (负指数甚至更荒谬!)

话虽如此,尾数位显然很有价值,以至于我认为必须有充分的理由牺牲尾数以支持指数.那是什么我认为这可能是为了表示两个相邻值之间的差异而无需使用次法线,但是即使这样,指数也不会发生太大变化(在+1023到-1022的整个范围内为-6为两倍). /p>

解决方案

IEEE-754浮点标准是在英特尔着手创建UC时由UC Berkeley的William Kahan教授担任英特尔顾问的工作而产生的. 8087数学协处理器.成为IEEE-754浮点格式的设计标准之一是在最大程度上与现有专有浮点格式进行功能兼容.这本书

John F. Palmer和Stephen P. Morse,"The 8087 Primer". 1984年,纽约,威利.

特别提到 CDC的60位浮点格式6600 , 对于双精度格式,其指数为11位,尾数为48位.

以下公开的采访(莫名其妙地使Jerome Coonen的名字改成Gerome Kunan)提供了一个简短的信息 IEEE-754的起源概述,包括对浮点格式选择的讨论:

Charles Severance,"IEEE 754:对William Kahan的采访", IEEE计算机,第1卷. 1998年3月31日第3期,第114-115页 (在线)

在采访中,威廉·卡汉(William Kahan)提到了极为流行的DEC VAX微型计算机的浮点格式的采用,尤其是 D格式来实现双精度,它使用与F格式相同数量的指数位(即8).这被发现 James Demmel在 IEEE计算机,第1卷. 1981年3月,第14卷,第3期,第51-62页 (在线)

解释了IEEE-754双精度的指数位数的选择,如下所示:

对于64位格式,主要考虑因素是范围;至少, 需求是任何两个32位数字的乘积不应 溢出64位格式.指数范围的最终选择 规定八个32位项的乘积不能溢出 64位格式-优化编译器的用户可能会受益 从以下指定的顺序重新排列算术运算的顺序 细心的程序员.

IEEE-754的扩展"浮点类型是专门作为中间格式引入的,它们简化了为相应的常规"浮点类型实现准确的标准数学函数的过程.

Jerome T. Coonen,对二进制浮点算法拟议标准的贡献".博士学位论文,大学1984年,伯克利加利福尼亚大学

指出前体是IBM 709x和Univac 1108机器中的扩展累加器,但我对这些格式不熟悉.

根据Coonen的说法,扩展格式中的尾数位的选择是由二进制十进制转换以及通用指数x y 的需求决定的. Palmer/Morse还提到了幂运算,并提供了详细信息:由于幂运算的错误放大特性,使用扩展格式的朴素计算需要在尾数中与在常规格式的指数中有多少位一样多,才能提供准确的结果. .由于双精度使用11个指数位,因此双扩展格式需要64个尾数位.

除了Coonen的博士论文,我还检查了IEEE-754标准发布之前发布的文档草稿,并且找不到以双扩展格式表示的15个指数位的明确理由.

从x87浮点单元的个人设计经验中,我知道基本数学函数的直接实现(没有中间溢出的危险)至少会激发三个额外的指数位.特别地,使用15位可能是硬件设计的产物. 8086 CPU使用16位字作为基本构建块,因此在双扩展格式中需要64个尾数位将导致该格式包含80位(= 5个字),为指数保留15位.

I've been looking at floating point formats, both IEEE 754 and x87. Here's a summary:

                Total       Bits per field
Precision       Bits    Sign  Exponent  Mantissa
Single          32      1     8         23  (+1 implicit)   
Double          64      1     11        52  (+1 implicit)
Extended (x87)  80      1     15        64
Quadruple       128     1     15        112 (+1 implicit) 

My question is, why do the higher-precision formats have so many exponent bits? Single-precision gets you a maximum value on the order of 10^38, and I can see how in extreme cases (number of atoms in the universe) you might need a larger exponent. But double-precision goes up to ~10^308, and extended- and quadruple-precision have even more exponent bits. This seems much larger than could ever be necessary for actual hardware-accelerated computation. (It's even more absurd with negative exponents!)

That being said, the mantissa bits are so obviously valuable that I figure there must be a good reason to sacrifice them in favor of the exponent. So what is it? I thought it might be to represent the difference between two adjacent values without needing subnormals, but even that doesn't take a big change in the exponent (-6 out of a full range of +1023 to -1022 for a double).

解决方案

The IEEE-754 floating-point standard grew out of work professor William Kahan of UC Berkeley had done as a consultant to Intel when Intel embarked on the creation of the 8087 math coprocessor. One of the design criteria for what became the IEEE-754 floating-point formats was functional compatibility with existing proprietary floating-point formats to the largest extent possible. The book

John F. Palmer and Stephen P. Morse, "The 8087 Primer". Wiley, New York 1984.

specifically mentions the 60-bit floating-point format of the CDC 6600, with an 11-bit exponent and 48-bit mantissa, with respect to the double-precision format.

The following published interview (which inexplicably mangles Jerome Coonen's name into Gerome Kunan) provides a brief overview of the genesis of IEEE-754, including a discussion of the choice of floating-point formats:

Charles Severance, "IEEE 754: An Interview with William Kahan", IEEE Computer, Vol. 31, No. 3, March 1998, pp. 114-115 (online)

In the interview, William Kahan mentions adoption of the floating-point formats of the extremely popular DEC VAX minicomputers, in particular the F format for single precision with 8 exponent bits, and the G format for double precision with 11 exponent bits.

The VAX F format goes back to DEC's earlier PDP-11 architecture, and the rationale for choosing 8 exponent bits is stated in PDP-11/40 Technical Memorandum #16: a desire to be able to represent all important physical constants, including the Plank constant (6.626070040 x 10-34) and the Avogadro constant (6.022140857 x 1023).

The VAX had originally used the D format for double precision, which used the same number of exponent bits, namely 8, as the F format. This was found to cause trouble through underflow in intermediate computations, for example in the LAPACK linear algebra routines, as noted in a contribution by James Demmel in NA Digest Sunday, February 16, 1992 Volume 92 : Issue 7. This issue is also alluded to in the interview with Kahan, in which it is mentioned that the subsequently introduced VAX G format was inspired by the CDC 6600 floating-point format.

David Stephenson, "A Proposed Standard for Binary Floating-Point Arithmetic", IEEE Computer, Vol. 14, No. 3, March 1981), pp. 51-62 (online)

explains the choice of number of exponent bits for IEEE-754 double precision as follows:

For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format — a possible boon to users of optimizing compilers which reorder the sequence of arithmetic operations from that specified by the careful programmer.

The "extended" floating-point types of IEEE-754 were introduced specifically as intermediate formats that ease implementation of accurate standard mathematical functions for the corresponding "regular" floating-point types.

Jerome T. Coonen, "Contributions to a Proposed Standard for Binary Floating-Point Arithmetic". PhD dissertation, Univ. of California, Berkeley 1984

states that precursors were extended accumulators in the IBM 709x and Univac 1108 machines, but I am not familiar with the formats used for those.

According to Coonen, the choice of the number of mantissa bits in extended formats was driven by the needs of binary-decimal conversion as well as general exponentiation xy. Palmer / Morse mention exponentiation as well and provide details: Due to the error magnification properties of exponentiation, a naive computation utilizing an extended format requires as many additional bits in the mantissa as there are bits in the exponent of the regular format to deliver accurate results. Since double precision uses 11 exponent bits, 64 mantissa bits are therefore required for the double-extended format.

I checked the draft documents published ahead of the release of the IEEE-754 standard in addition to Coonen's PhD thesis and was unable to find a stated rationale for the number of 15 exponent bits in the double-extended format.

From personal design experience with x87 floating-point units I am aware that the straightforward implementation of elementary math functions, without danger of intermediate overflow, motivates at least three additional exponent bit. The use of 15 bits specifically may be an artifact of the hardware design. The 8086 CPU used 16-bit words as a basic building block, so a requirement of 64 mantissa bits in the double-extended format would lead to a format comprising 80 bits (= five words), leaving 15 bits for the exponent.

这篇关于为什么高精度浮点格式具有这么多指数位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆