带有参数约束的定点功率(pow)功能的有效实现 [英] Efficient implementation of fixed-point power (pow) function with argument constraints

查看:102
本文介绍了带有参数约束的定点功率(pow)功能的有效实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找功能 pow(a,b)的有效实现,其中 a 限于区间(0,1),而 b > = 1 (都是实数,即不一定是整数).

I'm looking for an efficient implementation of the function pow(a, b), where a is limited to the interval (0,1), and b is >= 1 (both are real numbers—i.e., not necessarily integers).

如果有帮助,则 b 并不是一个很高的数字-假设它小于10-20.这将为迭代迭代解决此问题提供可能性,只需进行少量迭代〜= b

If it helps, b is not a high number—let's say it is less than 10-20. That would open up the possibility of solving this problem iteratively, with a small number of iterations ~= b

代码应在32位微控制器上运行,可能没有浮点单元(即使用定点实现).

The code should work on a 32-bit microcontroller, possibly without a floating-point unit (i.e., using a fixed-point implementation).

如何实现针对以下约束进行优化的此类功能?我正在寻找算法本身,因此可以接受伪代码.

How can I implement such a function, optimized for the following constraints? I'm looking for the algorithm itself, so pseudocode is acceptable.

推荐答案

OP指出的问题未指定.需要什么精度,需要什么性能?由于已知 a 是非负数,因此合适的起点是将 b 计算为 exp2(b * log2(a)),以及针对这些功能的定点实现.根据我在嵌入式系统中使用定点算法的经验,当用定点计算代替通用浮点计算时,使用s15.16定点格式通常是可表示范围和精度之间的良好折衷.所以我将在这里采用.

The problem as stated by OP is underspecified. What accuracy is needed, what performance is required? As a is known to be non-negative, a suitable starting point is to compute ab as exp2 (b * log2 (a)), with fixed-point implementations for those functions. Based on my experience with fixed-point arithmetic in embedded systems, using an s15.16 fixed-point format is often a good compromise between representable range and accuracy when generic floating-point computation is replaced with fixed-point computation. So I will adopt that here.

在定点实现 exp2() log2()的主要选择是按位计算,表中的二次插值和多项式逼近.后两者得益于可以产生双倍宽度乘积的硬件乘法器,并且基于表的方法最适合使用几千字节的缓存.OP的微控制器规范表明,用于代码和数据的资源可能会受到限制,并且这是相当低端的硬件.因此,按位方法可能是一个不错的起点,因为它只需要在两个函数之间共享一个很小的表.代码大小也很小,尤其是当编译器针对代码大小进行优化时(大多数编译器为 -Os ).

The principal choices for implementing exp2() and log2() in fixed-point are bitwise computation, quadratic interpolation in a table, and polynomial approximation. The latter two benefit from a hardware multiplier that can produce a double-width product, and table-based approaches work best with a cache of several kilobytes. The OP's specification of a microcontroller suggests that resources for both code and data may be limited, and that this is fairly low-end hardware. So the bit-wise approach may be a good starting point, as it requires only a tiny table shared between the two functions. the code size is small as well, in particular when the compiler is directed to optimize for code size (-Os with most compilers).

这些函数的按位计算也称为伪除法和伪乘法,并且与Henry Briggs(1561 – 1630)计算表对数的方式密切相关.

Bit-wise computations for these functions are also known a pseudo-division and pseudo-multiplication and are closely related to the way by which Henry Briggs (1561 – 1630) computed tables logarithms.

为了计算对数,我们通过反复试验从一系列特殊因素中选择了那些与函数参数相乘时将其归一化的特殊因素.对于选择的每个因子,我们将存储在表中的相应对数值相加.然后,总和对应于函数参数的对数.对于整数部分,我们希望选择2 i ,i = 1、2、4、8 ...,对于派系部分,我们选择1 + 2 -i ,我= 1、2、3、4、5 ...

For the computation of the logarithm, we chose, by trial and error, from among a sequence of particular factors those, that when multiplied with the function argument, normalize it to unity. For every factor chosen, we add up the corresponding logarithm value stored in a table. The sum then corresponds to the logarithm of the function argument. For the integer portion, we would want to chose 2i, i = 1, 2, 4, 8, ... and for the factional part we choose 1+2-i, i = 1, 2, 3, 4, 5 ...

指数算法紧密相关,实际上是对数算法的逆函数.从我们选择的因子的列表对数序列中,我们选择那些从函数参数中减去后最终将其减小为零的对数.从1开始,我们将所有对数在过程中减去的对数相乘.所得乘积与求幂的结果相对应.

The algorithm for the exponential is closely related and quite literally the inverse of the logarithm algorithm. From among the sequence of tabulated logarithms for our chosen factors, we pick those that when subtracted from the function argument ultimately reduce it to zero. Starting with unity, we multiply with all the factors whose logarithms we have subtracted in the process. The resulting product corresponds to the result of the exponentiation.

可以通过将算法应用于函数自变量的倒数来计算大于1的对数(对结果取反),同样,对负函数自变量求幂也需要我们对函数自变量求反,并在结束.因此在这方面,这两种算法也是对称的.

Logarithms of values greater than unity can be computed by applying the algorithm to the reciprocal of the function argument (with negation of the result as appropriate), likewise exponentiation with negative function arguments requires us negate the function argument and compute the reciprocal at the end. So in this aspect the two algorithms are, unsurprisingly, symmetric as well.

以下ISO-C99代码是这两种算法的直接实现.请注意,此代码使用带舍入的定点乘法.增量成本很小,并且确实提高了整个 pow()计算的准确性.

The following ISO-C99 code is a straightforward implementation of these two algorithms. Note that this code uses a fixed-point multiplication with rounding. The incremental cost is minimal and it does enhance the accuracy of the overall pow() computation.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

/* s15.16 division without rounding */
int32_t fxdiv_s15p16 (int32_t x, int32_t y)
{
    return ((int64_t)x * 65536) / y;
}

/* s15.16 multiplication with rounding */
int32_t fxmul_s15p16 (int32_t x, int32_t y)
{
    int32_t r;
    int64_t t = (int64_t)x * (int64_t)y;
    r = (int32_t)(uint32_t)(((uint64_t)t + (1 << 15)) >> 16);
    return r;
}

/* log2(2**8), ..., log2(2**1), log2(1+2**(-1), ..., log2(1+2**(-16)) */
const uint32_t tab [20] = {0x80000, 0x40000, 0x20000, 0x10000,
                           0x095c1, 0x0526a, 0x02b80, 0x01663,
                           0x00b5d, 0x005b9, 0x002e0, 0x00170, 
                           0x000b8, 0x0005c, 0x0002e, 0x00017, 
                           0x0000b, 0x00006, 0x00003, 0x00001};
const int32_t one_s15p16 = 1 * (1 << 16);
const int32_t neg_fifteen_s15p16 = (-15) * (1 << 16);

int32_t fxlog2_s15p16 (int32_t a)
{
    uint32_t x, y;
    int32_t t, r;

    x = (a > one_s15p16) ? fxdiv_s15p16 (one_s15p16, a) : a;
    y = 0;
    /* process integer bits */
    if ((t = x << 8) < one_s15p16) { x = t; y += tab [0]; }
    if ((t = x << 4) < one_s15p16) { x = t; y += tab [1]; }
    if ((t = x << 2) < one_s15p16) { x = t; y += tab [2]; }
    if ((t = x << 1) < one_s15p16) { x = t; y += tab [3]; }
    /* process fraction bits */
    for (int shift = 1; shift <= 16; shift++) {
        if ((t = x + (x >> shift)) < one_s15p16) { x = t; y += tab[3 + shift]; }
    }
    r = (a > one_s15p16) ? y : (0 - y);
    return r;
}

int32_t fxexp2_s15p16 (int32_t a) 
{
    uint32_t x, y;
    int32_t t, r;

    if (a <= neg_fifteen_s15p16) return 0; // underflow

    x = (a < 0) ? (-a) : (a);
    y = one_s15p16;
    /* process integer bits */
    if ((t = x - tab [0]) >= 0) { x = t; y = y << 8; }
    if ((t = x - tab [1]) >= 0) { x = t; y = y << 4; }
    if ((t = x - tab [2]) >= 0) { x = t; y = y << 2; }
    if ((t = x - tab [3]) >= 0) { x = t; y = y << 1; }
    /* process fractional bits */
    for (int shift = 1; shift <= 16; shift++) {
        if ((t = x - tab [3 + shift]) >= 0) { x = t; y = y + (y >> shift); }
    }
    r = (a < 0) ? fxdiv_s15p16 (one_s15p16, y) : y;
    return r;
}

/* compute a**b for a >= 0 */
int32_t fxpow_s15p16 (int32_t a, int32_t b)
{
    return fxexp2_s15p16 (fxmul_s15p16 (b, fxlog2_s15p16 (a)));
}

double s15p16_to_double (int32_t a)
{
    return a / 65536.0;
}

int32_t double_to_s15p16 (double a)
{
    return ((int32_t)(a * 65536.0 + ((a < 0) ? (-0.5) : 0.5)));
}

int main (void) 
{
    int32_t a = double_to_s15p16 (0.125);
    do {
        int32_t b = double_to_s15p16 (-5);
        do {
            double fa = s15p16_to_double (a);
            double fb = s15p16_to_double (b);
            double reff = pow (fa, fb);
            int32_t res = fxpow_s15p16 (a, b);
            int32_t ref = double_to_s15p16 (reff);
            double fres = s15p16_to_double (res);
            double fref = s15p16_to_double (ref);
            printf ("a=%08x (%15.8e) b=%08x (% 15.8e) res=%08x (% 15.8e) ref=%08x (%15.8e)\n", 
                    a, fa, b, fb, res, fres, ref, fref);
            b += double_to_s15p16 (0.5);
        } while (b <= double_to_s15p16 (6));
        printf ("\n");
        a += double_to_s15p16 (0.125);
    } while (a <= double_to_s15p16 (1.0));
    return EXIT_SUCCESS;
}

对于具有合理数量缓存和快速整数乘积的处理器,我们可以在表中使用二次插值,从而对 log2() exp2()进行非常准确和快速的计算>.为此,我们使用伪浮点表示形式,其中参数 x = 2 i (1+ f ),其中[0,1)中的 f .小数 f 使用32位的完整字长进行归一化.对于对数,我们将落入[0,1)的log2(1+ f )制成表格.对于指数,我们将exp2( f )-1制成表格,该值也属于[0,1).显然,我们需要在表插值的结果上加1.

For processors with a reasonable amount cache and a fast integer multiply, we can use quadratic interpolation in tables for very accurate and fast computation on log2() and exp2(). To do so, we make use of a pseudo-floating-point representation where the argument x = 2i(1+f), with f in [0,1). The fraction f is normalized using the full word size of 32 bits. For the logarithm, we tabulate log2(1+f) which falls into [0, 1). For the exponential we tabulate exp2(f)-1 which also falls into [0,1). Obviously we need to add 1 to the result of the table interpolation.

对于二次插值,我们总共需要三个表条目,通过它们我们可以拟合抛物线,其系数 a b 可以随时计算.函数参数和最接近的节点之间的区别为 dx ,然后我们可以添加a( dx ) 2 + b( dx)插入相应的表项以进行插值.需要三个连续的表格条目来适应抛物线,这也意味着我们需要将主要插值间隔之外的两个附加值制成表格.可以通过在末尾添加一个舍入常数来部分补偿由于使用定点算法进行中间计算而导致的不准确性,这需要通过实验确定.

For a quadratic interpolation we need a total of three table entries through which we fit a parabola, whose coefficients a and b can be computed on the fly. The difference between the function argument and the closest node as dx, we can then add a(dx)2+b(dx) to the corresponding table entry for interpolation. The need for three consecutive table entries to fit the parabola also means we need to tabulate two additional values beyond our primary interpolation interval. The inaccuracy due to the use of fixed-point arithmetic for intermediate computation can partially be compensated by adding a rounding constant at the end, which needs to be determined experimentally.

    /* for i = 0 to 65: log2 (1 + i/64) * 2**31 */
    uint32_t logTab [66] =
    {
        0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50, 
        0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2, 
        0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c, 
        0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d, 
        0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6, 
        0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a, 
        0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e, 
        0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751, 
        0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa, 
        0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0, 
        0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5, 
        0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b, 
        0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b, 
        0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373, 
        0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8, 
        0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846, 
        0x80000000, 0x816fe50b
    };
    
    /* for i = 0 to 129: exp2 ((i/128) - 1) * 2**31 */
    uint32_t expTab [130] =
    {
        0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
        0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
        0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
        0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
        0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
        0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
        0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
        0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
        0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
        0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
        0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
        0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
        0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
        0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
        0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
        0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
        0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
        0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
        0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
        0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
        0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
        0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
        0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
        0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
        0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
        0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
        0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
        0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
        0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
        0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
        0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
        0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
        0x80000000, 0x8163daa0,
    };
    
    uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0};
    
    /* count leading zeros; this is a machine instruction on many architectures */
    int32_t clz (uint32_t a)
    {
        a |= a >> 16;
        a |= a >> 8;
        a |= a >> 4;
        a |= a >> 2;
        a |= a >> 1;
        return clz_tab [0x07c4acdd * a >> 27];
    }
    
    int32_t fxlog2_s15p16 (int32_t arg)
    {
        int32_t lz, f1, f2, dx, a, b, approx;
        uint32_t t, idx, x = arg;
        lz = clz (x);
        /* shift off integer bits and normalize fraction 0 <= f <= 1 */
        t = x << (lz + 1);
        /* index table of values log2 (1 + f) using 6 msbs of fraction */
        idx = (unsigned)t >> (32 - 6);
        /* difference between argument and smallest sampling point */
        dx = t - (idx << (32 - 6));
        /* fit parabola through closest three sampling points; find coeffs a, b */
        f1 = (logTab[idx+1] - logTab[idx]);
        f2 = (logTab[idx+2] - logTab[idx]);
        a = f2 - (f1 << 1);
        b = (f1 << 1) - a;
        /* find function value for argument by computing ((a*dx+b)*dx) */
        approx = (int32_t)((((int64_t)a)*dx) >> (32 - 6)) + b;
        approx = (int32_t)((((int64_t)approx)*dx) >> (32 - 6 + 1));
        /* compute fraction result; add experimentally determined rounding constant */
        approx = logTab[idx] + approx + 0x410d;
        /* combine integer and fractional parts of result */
        approx = ((15 - lz) << 16) + (((uint32_t)approx) >> 15);
        return approx;
    }
    
    int32_t fxexp2_s15p16 (int32_t x)
    {
        int32_t f1, f2, dx, a, b, approx, idx, i, f;
    
        /* extract integer portion; 2**i is realized as a shift at the end */
        i = (x >> 16);
        /* extract fraction f so we can compute 2**(f)-1, 0 <= f < 1 */
        f = x & 0xffff;
        /* index table of values exp2 (f) - 1 using 7 msbs of fraction */
        idx = (uint32_t)f >> (16 - 7);
        /* difference between argument and next smaller sampling point */
        dx = f - (idx << (16 - 7));
        /* fit parabola through closest three sampling point; find coeffs a,b */
        f1 = (expTab[idx+1] - expTab[idx]);
        f2 = (expTab[idx+2] - expTab[idx]);
        a = f2 - (f1 << 1);
        b = (f1 << 1) - a;
        /* find function value for argument by computing ((a*dx+b)*dx) */
        approx = (int32_t)((((int64_t)a)*dx) >> (16 - 7)) + b;
        approx = (int32_t)((((int64_t)approx)*dx) >> (16 - 7 + 1));
        /* combine integer and fractional parts of result; add 1.0 and experimentally determined rounding constant */
        approx = (((expTab[idx] + (unsigned)approx + 0x80000012U) >> (14 - i)) + 1) >> 1;
        /* Handle underflow to 0 */
        approx = ((i < -16) ? 0 : approx);
        return approx;
    }

最后,在数据存储受限但提供非常快速的整数乘法器的平台上,可以使用minimax类型的多项式逼近,即最小化最大误差的多项式逼近.此代码类似于基于表的方法,不同之处在于,通过多项式而不是表来计算[0,1)中f的两个核心近似值log2(1 + f)和exp2(f)-1.为了获得最大的准确性和代码效率,我们希望将系数表示为无符号纯分数,即使用u0.32定点格式.在log2()的情况下,前导系数为〜= 1.44,因此不适合此格式.但是,将这个系数分为1和0.44两部分很容易解决.对于定点计算,通常不需要通过Horner方案进行评估,如

Finally, on platforms for which data storage is limited but which provide a very fast integer multiplier, one can use a polynomial approximation of the minimax kind, i.e. one which minimizes the maximum error. The code for this will be similar to the table-based method, except that the two core approximations, log2(1+f) and exp2(f)-1 for f in [0,1), are computed via polynomials instead of tables. For maximum accuracy and code efficiency, we would want to represent the coefficients as unsigned pure fractions, that is, by using a u0.32 fixed-point format. In the case of log2(), the leading coefficient is ~= 1.44, so will not fit into this format. However, this is easily solved by splitting this coefficient into two parts, 1 and 0.44. For fixed-point computation evaluation via Horner scheme is typically not necessary, and best use of a pipelined multiplier plus an increase in instruction-level parallelism can be achieved with an Estrin-like scheme, as pointed out in

Claude-Pierre Jeannerod,Jingyan Jourdan-Lu,用于VLIW整数处理器的同时浮点正弦和余弦".在2012年7月的 23 rd IEEE国际特定应用系统,体系结构和处理器国际会议中,第69-76页(

Claude-Pierre Jeannerod, Jingyan Jourdan-Lu, "Simultaneous floating-point sine and cosine for VLIW integer processors". In 23rd IEEE International Conference on Application-Specific Systems, Architectures and Processors, July 2012, pp. 69-76 (preprint online)

    /* a single instruction in many 32-bit architectures */
    uint32_t umul32hi (uint32_t a, uint32_t b)
    {
        return (uint32_t)(((uint64_t)a * b) >> 32);
    }
    
    uint8_t clz_tab[32] = {
            31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
            23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0};
    
    /* count leading zeros; this is a machine instruction on many architectures */
    int32_t clz (uint32_t a)
    {
        a |= a >> 16;
        a |= a >> 8;
        a |= a >> 4;
        a |= a >> 2;
        a |= a >> 1;
        return clz_tab [0x07c4acdd * a >> 27];
    }
    
    /* compute log2() with s15.16 fixed-point argument and result */
    int32_t fxlog2_s15p16 (int32_t arg)
    {
        const uint32_t a0 = (uint32_t)((1.44269476063 - 1)* (1LL << 32) + 0.5);
        const uint32_t a1 = (uint32_t)(7.2131008654833e-1 * (1LL << 32) + 0.5);
        const uint32_t a2 = (uint32_t)(4.8006370104849e-1 * (1LL << 32) + 0.5);
        const uint32_t a3 = (uint32_t)(3.5339481476694e-1 * (1LL << 32) + 0.5);
        const uint32_t a4 = (uint32_t)(2.5600972794928e-1 * (1LL << 32) + 0.5);
        const uint32_t a5 = (uint32_t)(1.5535182948224e-1 * (1LL << 32) + 0.5);
        const uint32_t a6 = (uint32_t)(6.3607925549150e-2 * (1LL << 32) + 0.5);
        const uint32_t a7 = (uint32_t)(1.2319647939876e-2 * (1LL << 32) + 0.5);
        int32_t lz;
        uint32_t approx, h, m, l, z, y, x = arg;
        lz = clz (x);
        /* shift off integer bits and normalize fraction 0 <= f <= 1 */
        x = x << (lz + 1);
        y = umul32hi (x, x); // f**2
        z = umul32hi (y, y); // f**4
        /* evaluate minimax polynomial to compute log2(1+f) */
        h = a0 - umul32hi (a1, x);
        m = umul32hi (a2 - umul32hi (a3, x), y);
        l = umul32hi (a4 - umul32hi (a5, x) + umul32hi(a6 - umul32hi(a7, x), y), z);
        approx = x + umul32hi (x, h + m + l);
        /* combine integer and fractional parts of result; round result */
        approx = ((15 - lz) << 16) + ((((approx) >> 15) + 1) >> 1);
        return approx;
    }

    /* compute exp2() with s15.16 fixed-point argument and result */
    int32_t fxexp2_s15p16 (int32_t arg)
    {
        const uint32_t a0 = (uint32_t)(6.9314718107e-1 * (1LL << 32) + 0.5);
        const uint32_t a1 = (uint32_t)(2.4022648809e-1 * (1LL << 32) + 0.5);
        const uint32_t a2 = (uint32_t)(5.5504413787e-2 * (1LL << 32) + 0.5);
        const uint32_t a3 = (uint32_t)(9.6162736882e-3 * (1LL << 32) + 0.5);
        const uint32_t a4 = (uint32_t)(1.3386828359e-3 * (1LL << 32) + 0.5);
        const uint32_t a5 = (uint32_t)(1.4629773796e-4 * (1LL << 32) + 0.5);
        const uint32_t a6 = (uint32_t)(2.0663021132e-5 * (1LL << 32) + 0.5);
        int32_t i;
        uint32_t approx, h, m, l, s, q, f, x = arg;
    
        /* extract integer portion; 2**i is realized as a shift at the end */
        i = ((x >> 16) ^ 0x8000) - 0x8000;
        /* extract and normalize fraction f to compute 2**(f)-1, 0 <= f < 1 */
        f = x << 16;
        /* evaluate minimax polynomial to compute exp2(f)-1 */
        s = umul32hi (f, f); // f**2
        q = umul32hi (s, s); // f**4
        h = a0 + umul32hi (a1, f);
        m = umul32hi (a2 + umul32hi (a3, f), s);
        l = umul32hi (a4 + umul32hi (a5, f) + umul32hi (a6, s), q);
        approx = umul32hi (f, h + m + l);
        /* combine integer and fractional parts of result; round result */
        approx = ((approx >> (15 - i)) + (0x80000000 >> (14 - i)) + 1) >> 1;
        /* handle underflow to 0 */
        approx = ((i < -16) ? 0 : approx);
        return approx;
    }

这篇关于带有参数约束的定点功率(pow)功能的有效实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆