单精度和双精度浮点运算有什么区别? [英] What's the difference between a single precision and double precision floating point operation?

查看:496
本文介绍了单精度和双精度浮点运算有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

单精度浮点运算和双精度浮点运算有什么不同?

我特别感兴趣的是与视频游戏控制台有关的实际术语。例如,任天堂64有一个64位的处理器,如果这样做,那么这是否意味着它能够进行双精度浮点运算? PS3和Xbox 360可以撤销双精度浮点运算或者单精度浮点运算,并且通常使用双精度浮点运算(如果它们存在的话)。

解决方案

注意:任天堂64 确实有64-但是:


许多游戏都利用了该芯片的32位处理模式,因为64位数据可提供更高的数据精度类型通常不是3D游戏所需要的,以及处理64位数据使用的RAM,缓存和带宽两倍的事实,从而降低了整个系统的性能。


Webopedia


术语双精度是错误的,因为精度不是真正的双精度。

单字double来源于双精度数字使用两倍于常规浮点数字的事实。

例如,如果单精度数字需要32位,则其double精度对应的将是64位长。



额外的位不仅增加了精度,而且还增加了可表示的幅度范围。

精度和幅度范围的增加的确切量取决于程序用于表示浮点值的格式。

大多数计算机使用称为IEEE浮点格式的标准格式。

来自 IEEE浮点运算标准

单精度

IEEE单精度浮点标准表示法需要一个32位的字,可以从0到31的数字表示从左到右。
$ b


  • 第一位是符号位S,

  • 接下来的8位是指数位,'E'和

  • 最后的23位是分数 F':

      S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 
    0 1 8 9 31



    由这个单词表示的值V可以如下确定:$ b $如果E = 255且F不为零,那么V = NaN(不是数字)如果E = 255,则B =
    如果E = 255且F是零且S是0,则V =无穷大
    如果 0 ,则 V =( - 1)** S * 2 **(E-127)*(1 .F)其中1.F是
    ,意在表示通过用
    隐式前导1和二进制点前缀F而创建的二进制数。如果E = 0且F不为零,那么 V =( - 1)** S * 2 **(-126)*(0.F)。这些
    是非标准化值。如果E = 0且F是零且S是1,则V = -0

  • >如果E = 0且F是零且S是0,则V = 0


    特别是,

      0 00000000 000000000000000000000000000 = 0 
    1 00000000 00000000000000000000000 = -0

    0 11111111 00000000000000000000000 =无穷
    1 11111111 00000000000000000000000 = -Infinity

    0 11111111 00000100000000000000000 = NaN
    1 11111111 00100010001001010101010 =
    $ b $ 10000000 00000000000000000000000 = +1 * 2 **(128-127 )* 1.0 = 2
    0 10000001 10100000000000000000000 = +1 * 2 **(129-127)* 1.101 = 6.5
    1 10000001 10100000000000000000000 = -1 * 2 **(129-127)* 1.101 = -6.5

    00000001 00000000000000000000000 = +1 * 2 **(1-127)* 1.0 = 2 **( - 126)
    0 00000000 100000000000000000000000000 = +1 * 2 **( -126)* 0.1 = 2 **( - 127)
    0 00000000 00000000000000000000001 = +1 * 2 **( - 126)*
    0.00000000000000000000001 =
    2 **( - 149)(最小正值)

    双精度

    IEEE双精度浮点标准表示形式需要一个64位字,它可以表示为从0到63的编号,从左到右。
    $ b


    • 第一位是符号位S,

    • 接下来的11位是指数位,'E'和

    • 最后的52位是分数 F':

        S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 
      0 1 11 12 63



      由这个单词表示的值V可以如下确定:$ b $如果E = 2047且F不为零,则V = NaN(不是数字)如果E = 2047并且F是零且S是1,则V = -Infinity
      如果E = 2047并且F是零并且S是0,则V = Infinity 0 那么 V =( - 1)** S * 2 **(E-1023)*(1 .F)其中1.F是
      ,意在表示通过用
      隐式前导1和二进制点前缀F而创建的二进制数。 b $ b

    • 如果E = 0和F是非零,那么 V =( - 1)** S * 2 **(-1022)*(0.F)这些
      是如果E = 0并且F是零并且S是1,那么V = -0
      如果E = 0且F是零,S是0,那么V = 0


      参考:

      ANSI / IEEE标准754-1985 ,

      二进制浮点运算的标准。


      What is the difference between a single precision floating point operation and double precision floating operation?

      I'm especially interested in practical terms in relation to video game consoles. For example, does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations? Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of (if they exist?).

      解决方案

      Note: the Nintendo 64 does have a 64-bit processor, however:

      Many games took advantage of the chip's 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games, as well as the fact that processing 64-bit data uses twice as much RAM, cache, and bandwidth, thereby reducing the overall system performance.

      From Webopedia:

      The term double precision is something of a misnomer because the precision is not really double.
      The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number.
      For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.

      The extra bits increase not only the precision but also the range of magnitudes that can be represented.
      The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values.
      Most computers use a standard format known as the IEEE floating-point format.

      From the IEEE standard for floating point arithmetic

      Single Precision

      The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.

      • The first bit is the sign bit, S,
      • the next eight bits are the exponent bits, 'E', and
      • the final 23 bits are the fraction 'F':

        S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
        0 1      8 9                    31
        

      The value V represented by the word may be determined as follows:

      • If E=255 and F is nonzero, then V=NaN ("Not a number")
      • If E=255 and F is zero and S is 1, then V=-Infinity
      • If E=255 and F is zero and S is 0, then V=Infinity
      • If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
      • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F). These are "unnormalized" values.
      • If E=0 and F is zero and S is 1, then V=-0
      • If E=0 and F is zero and S is 0, then V=0

      In particular,

      0 00000000 00000000000000000000000 = 0
      1 00000000 00000000000000000000000 = -0
      
      0 11111111 00000000000000000000000 = Infinity
      1 11111111 00000000000000000000000 = -Infinity
      
      0 11111111 00000100000000000000000 = NaN
      1 11111111 00100010001001010101010 = NaN
      
      0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
      0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
      1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
      
      0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
      0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
      0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                           0.00000000000000000000001 = 
                                           2**(-149)  (Smallest positive value)
      

      Double Precision

      The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.

      • The first bit is the sign bit, S,
      • the next eleven bits are the exponent bits, 'E', and
      • the final 52 bits are the fraction 'F':

        S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
        0 1        11 12                                                63
        

      The value V represented by the word may be determined as follows:

      • If E=2047 and F is nonzero, then V=NaN ("Not a number")
      • If E=2047 and F is zero and S is 1, then V=-Infinity
      • If E=2047 and F is zero and S is 0, then V=Infinity
      • If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
      • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
      • If E=0 and F is zero and S is 1, then V=-0
      • If E=0 and F is zero and S is 0, then V=0

      Reference:
      ANSI/IEEE Standard 754-1985,
      Standard for Binary Floating Point Arithmetic.

      这篇关于单精度和双精度浮点运算有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆