使用softfloat库进行浮点倒数逼近 [英] Floating point reciprocal approximation using softfloat library
问题描述
我正在使用softfloat库( http://www.jhauser.us/arithmetic/SoftFloat .html )来实现单精度除法算法. 我试图了解实现的倒数逼近函数 作为softfloat库的一部分.请参见下面的代码.谁能 解释他们是如何提出LUT的?它看起来像是LUT和NR近似值的组合,但是详细的解释肯定会有所帮助.
I am using the softfloat library (http://www.jhauser.us/arithmetic/SoftFloat.html) to implement a single precision division algorithm. I am trying to understand the reciprocal approximation function implemented as part of the softfloat library. Please see below the code. Could anyone explain how they came up with the LUT? It looks like a combination of LUT and NR approximations, but a detailed explanation would definitely help.
/*
Returns an approximation to the reciprocal of the number represented by `a',
where `a' is interpreted as an unsigned fixed-point number with one integer
bit and 31 fraction bits. The `a' input must be "normalized", meaning that
its most-significant bit (bit 31) must be 1. Thus, if A is the value of
the fixed-point interpretation of `a', then 1 <= A < 2. The returned value
is interpreted as a pure unsigned fraction, having no integer bits and 32
fraction bits. The approximation returned is never greater than the true
reciprocal 1/A, and it differs from the true reciprocal by at most 2.006 ulp
(units in the last place).
*/
uint32_t softfloat_approxRecip32_1( uint32_t a )
{
int index;
uint16_t eps;
static const uint16_t k0s[] = {
0xFFC4, 0xF0BE, 0xE363, 0xD76F, 0xCCAD, 0xC2F0, 0xBA16, 0xB201,
0xAA97, 0xA3C6, 0x9D7A, 0x97A6, 0x923C, 0x8D32, 0x887E, 0x8417
};
static const uint16_t k1s[] = {
0xF0F1, 0xD62C, 0xBFA1, 0xAC77, 0x9C0A, 0x8DDB, 0x8185, 0x76BA,
0x6D3B, 0x64D4, 0x5D5C, 0x56B1, 0x50B6, 0x4B55, 0x4679, 0x4211
};
uint16_t r0;
uint32_t delta0;
uint_fast32_t r;
uint32_t sqrDelta0;
index = a>>27 & 0xF;
eps = (uint16_t) (a>>11);
r0 = k0s[index] - ((k1s[index] * (uint_fast32_t) eps)>>20);
delta0 = ~(uint_fast32_t) ((r0 * (uint_fast64_t) a)>>7);
r = ((uint_fast32_t) r0<<16) + ((r0 * (uint_fast64_t) delta0)>>24);
sqrDelta0 = ((uint_fast64_t) delta0 * delta0)>>32;
r += ((uint32_t) r * (uint_fast64_t) sqrDelta0)>>48;
return r;
}
推荐答案
初始逼近度r0
是通过分段线性逼近计算的,使用从[1,17/16)到[15/16)的十六个间隔, 2),由1.31定点参数的四个最高有效小数位选择.然后使用广义牛顿迭代对倒数r new = r old + r old *(1-a * r old )+ r old *(1-a * r old ) 2 + ... + r old *(1-a * r old ) k [请参见
The initial approximation r0
is computed by piece-wise linear approximation, using sixteen intervals, from [1, 17/16) to [15/16, 2), selected by the four most significant fractional bits of the 1.31 fixed-point argument. The initial estimate is then refined using the generalized Newton iteration for the reciprocal rnew = rold + rold * (1 - a * rold) + rold * (1 - a * rold)2 + ... + rold * (1 - a * rold)k [see paper by Liddicoat and Flynn]. delta0
is (1 - a * r0). The first three terms of the expansion are used: r = r0 + r0 * delta0 + r0 * delta02. This iteration has cubic convergence, tripling the number of correct bits in every iteration. In this implementation, worst case relative error in r0
is about 9.44e-4, while the worst case relative error in the final result r
is about 9.32e-10.
选择定点计算中的比例因子以最大程度地提高中间计算的精度(通过保留尽可能多的位),并使定点方便地落在单词边界上,如delta0
,因此可以省略1.
The scale factors in the fixed-point computation are chosen to maximize the accuracy of intermediate computations (by retaining as many bits as possible), and to have the fixed-point conveniently fall on a word boundary, as in the computation of delta0
in which the 1 therefore can be omitted.
代码要求delta0
为正数,因此r0
必须始终低估数学结果1/a.因此,每个间隔的线性近似不能为minimax近似.取而代之的是,计算每个间隔的端点的函数值1/a之间的斜率,并将按2 16 缩放的绝对值存储在k0s
中,这意味着数组元素固定为0.16点数.从每个间隔的中点的函数值开始,然后应用斜率来查找每个间隔的左端点的截距.同样,此值也按2 16 缩放,并存储在k1s
中,因此该值也包含0.16个定点数.
The code requires delta0
to be a positive quantity, therefore r0
must always be an underestimation of the mathematical result 1/a. The linear approximation for each interval therefore cannot be minimax approximations. Instead the slope between the function values 1/a of the endpoints of each interval is computed, and the absolute value of that scaled by 216 is stored in k0s
, meaning the array elements are 0.16 fixed-point numbers. Starting at the function value for the midpoint of each interval, the slope is then applied to find the intercept for the left endpoint of each interval. This value is likewise scaled by 216 and stored in k1s
which therefore also holds 0.16 fixed-point numbers.
根据我的分析,对于k0s
中的条目,似乎在浮点到定点转换中采用了向0的舍入,而对于k0s
中的项在浮点到定点转换中采用了向正无穷的舍入. k1s
中的条目.以下程序实现了上面概述的算法,并生成与问题代码中所使用的表条目相同的表条目.
Based on my analysis, it seems that rounding towards 0 is employed in the floating-point to fixed-point conversion for entries in k0s
while rounding towards positive infinity is employed in the floating-point to fixed-point conversion for entries in k1s
. The following program implements the algorithm outlined above and produces table entries identical to those used in the code in the question.
#include <stdlib.h>
#include <stdio.h>
int main (void)
{
printf ("interval k0 k1\n");
for (int i = 0; i < 16; i++) {
double x0 = 1.0+i/16.0; // left endpoint of interval
double x1 = 1.0+(i+1)/16.0; // right endpoint of interval
double f0 = 1.0 / x0;
double f1 = 1.0 / x1;
double df = f0 - f1;
double sl = df * 16.0; // slope across interval
double mp = (x0 + x1) / 2.0; // midpoint of interval
double fm = 1.0 / mp;
double ic = fm + df / 2.0; // intercept at start of interval
printf ("%5d %04x %04x\n",
i, (int)(ic * 65536.0 - 0.9999), (int)(sl * 65536.0 + 0.9999));
}
return EXIT_SUCCESS;
}
以上程序的输出应如下:
The output of the above program should be as follows:
interval k0 k1
0 ffc4 f0f1
1 f0be d62c
2 e363 bfa1
3 d76f ac77
4 ccad 9c0a
5 c2f0 8ddb
6 ba16 8185
7 b201 76ba
8 aa97 6d3b
9 a3c6 64d4
10 9d7a 5d5c
11 97a6 56b1
12 923c 50b6
13 8d32 4b55
14 887e 4679
15 8417 4211
这篇关于使用softfloat库进行浮点倒数逼近的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!