使用softfloat库进行浮点倒数逼近 [英] Floating point reciprocal approximation using softfloat library

查看：176 发布时间：2020/11/8 22:11:46 floating-point

本文介绍了使用softfloat库进行浮点倒数逼近的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用softfloat库( http://www.jhauser.us/arithmetic/SoftFloat .html )来实现单精度除法算法. 我试图了解实现的倒数逼近函数作为softfloat库的一部分.请参见下面的代码.谁能解释他们是如何提出LUT的?它看起来像是LUT和NR近似值的组合，但是详细的解释肯定会有所帮助.

I am using the softfloat library (http://www.jhauser.us/arithmetic/SoftFloat.html) to implement a single precision division algorithm. I am trying to understand the reciprocal approximation function implemented as part of the softfloat library. Please see below the code. Could anyone explain how they came up with the LUT? It looks like a combination of LUT and NR approximations, but a detailed explanation would definitely help.

/*
  Returns an approximation to the reciprocal of the number represented by `a',
  where `a' is interpreted as an unsigned fixed-point number with one integer
  bit and 31 fraction bits.  The `a' input must be "normalized", meaning that
  its most-significant bit (bit 31) must be 1.  Thus, if A is the value of
  the fixed-point interpretation of `a', then 1 <= A < 2.  The returned value
  is interpreted as a pure unsigned fraction, having no integer bits and 32
  fraction bits.  The approximation returned is never greater than the true
  reciprocal 1/A, and it differs from the true reciprocal by at most 2.006 ulp 
  (units in the last place).
*/

uint32_t softfloat_approxRecip32_1( uint32_t a )
{
    int index;
    uint16_t eps;
    static const uint16_t k0s[] = {
      0xFFC4, 0xF0BE, 0xE363, 0xD76F, 0xCCAD, 0xC2F0, 0xBA16, 0xB201,
      0xAA97, 0xA3C6, 0x9D7A, 0x97A6, 0x923C, 0x8D32, 0x887E, 0x8417
    };
    static const uint16_t k1s[] = {
      0xF0F1, 0xD62C, 0xBFA1, 0xAC77, 0x9C0A, 0x8DDB, 0x8185, 0x76BA,
      0x6D3B, 0x64D4, 0x5D5C, 0x56B1, 0x50B6, 0x4B55, 0x4679, 0x4211
    };

    uint16_t r0;
    uint32_t delta0;
    uint_fast32_t r;
    uint32_t sqrDelta0;

    index = a>>27 & 0xF;
    eps = (uint16_t) (a>>11);
    r0 = k0s[index] - ((k1s[index] * (uint_fast32_t) eps)>>20);
    delta0 = ~(uint_fast32_t) ((r0 * (uint_fast64_t) a)>>7);
    r = ((uint_fast32_t) r0<<16) + ((r0 * (uint_fast64_t) delta0)>>24);
    sqrDelta0 = ((uint_fast64_t) delta0 * delta0)>>32;
    r += ((uint32_t) r * (uint_fast64_t) sqrDelta0)>>48;
    return r;

}

推荐答案

初始逼近度r0是通过分段线性逼近计算的，使用从[1，17/16)到[15/16)的十六个间隔， 2)，由1.31定点参数的四个最高有效小数位选择.然后使用广义牛顿迭代对倒数r _new = r _old + r _old *(1-a * r _old)+ r _old *(1-a * r _old)² + ... + r _old *(1-a * r _old)^k [请参见

The initial approximation r0 is computed by piece-wise linear approximation, using sixteen intervals, from [1, 17/16) to [15/16, 2), selected by the four most significant fractional bits of the 1.31 fixed-point argument. The initial estimate is then refined using the generalized Newton iteration for the reciprocal r_new = r_old + r_old * (1 - a * r_old) + r_old * (1 - a * r_old)² + ... + r_old * (1 - a * r_old)^k [see paper by Liddicoat and Flynn]. delta0 is (1 - a * r₀). The first three terms of the expansion are used: r = r₀ + r₀ * delta₀ + r₀ * delta₀². This iteration has cubic convergence, tripling the number of correct bits in every iteration. In this implementation, worst case relative error in r0 is about 9.44e-4, while the worst case relative error in the final result r is about 9.32e-10.

选择定点计算中的比例因子以最大程度地提高中间计算的精度(通过保留尽可能多的位)，并使定点方便地落在单词边界上，如delta0，因此可以省略1.

The scale factors in the fixed-point computation are chosen to maximize the accuracy of intermediate computations (by retaining as many bits as possible), and to have the fixed-point conveniently fall on a word boundary, as in the computation of delta0 in which the 1 therefore can be omitted.

代码要求delta0为正数，因此r0必须始终低估数学结果1/a.因此，每个间隔的线性近似不能为minimax近似.取而代之的是，计算每个间隔的端点的函数值1/a之间的斜率，并将按2 ¹⁶缩放的绝对值存储在k0s中，这意味着数组元素固定为0.16点数.从每个间隔的中点的函数值开始，然后应用斜率来查找每个间隔的左端点的截距.同样，此值也按2 ¹⁶缩放，并存储在k1s中，因此该值也包含0.16个定点数.

The code requires delta0 to be a positive quantity, therefore r0 must always be an underestimation of the mathematical result 1/a. The linear approximation for each interval therefore cannot be minimax approximations. Instead the slope between the function values 1/a of the endpoints of each interval is computed, and the absolute value of that scaled by 2¹⁶ is stored in k0s, meaning the array elements are 0.16 fixed-point numbers. Starting at the function value for the midpoint of each interval, the slope is then applied to find the intercept for the left endpoint of each interval. This value is likewise scaled by 2¹⁶ and stored in k1s which therefore also holds 0.16 fixed-point numbers.

根据我的分析，对于k0s中的条目，似乎在浮点到定点转换中采用了向0的舍入，而对于k0s中的项在浮点到定点转换中采用了向正无穷的舍入. k1s中的条目.以下程序实现了上面概述的算法，并生成与问题代码中所使用的表条目相同的表条目.

Based on my analysis, it seems that rounding towards 0 is employed in the floating-point to fixed-point conversion for entries in k0s while rounding towards positive infinity is employed in the floating-point to fixed-point conversion for entries in k1s. The following program implements the algorithm outlined above and produces table entries identical to those used in the code in the question.

#include <stdlib.h>
#include <stdio.h>

int main (void)
{
    printf ("interval  k0    k1\n");
    for (int i = 0; i < 16; i++) {
        double x0 = 1.0+i/16.0;       // left endpoint of interval
        double x1 = 1.0+(i+1)/16.0;   // right endpoint of interval
        double f0 = 1.0 / x0;
        double f1 = 1.0 / x1;
        double df = f0 - f1;
        double sl = df * 16.0;        // slope across interval
        double mp = (x0 + x1) / 2.0;  // midpoint of interval
        double fm = 1.0 / mp;
        double ic = fm + df / 2.0;    // intercept at start of interval

        printf ("%5d     %04x  %04x\n",
                i, (int)(ic * 65536.0 - 0.9999), (int)(sl * 65536.0 + 0.9999));
    }
    return EXIT_SUCCESS;
}

以上程序的输出应如下:

The output of the above program should be as follows:

interval  k0    k1
    0     ffc4  f0f1
    1     f0be  d62c
    2     e363  bfa1
    3     d76f  ac77
    4     ccad  9c0a
    5     c2f0  8ddb
    6     ba16  8185
    7     b201  76ba
    8     aa97  6d3b
    9     a3c6  64d4
   10     9d7a  5d5c
   11     97a6  56b1
   12     923c  50b6
   13     8d32  4b55
   14     887e  4679
   15     8417  4211

这篇关于使用softfloat库进行浮点倒数逼近的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用softfloat库进行浮点倒数逼近 [英] Floating point reciprocal approximation using softfloat library

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用softfloat库进行浮点倒数逼近 [英] Floating point reciprocal approximation using softfloat library

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭