使用 SSE 最快实现自然指数函数 [英] Fastest Implementation of the Natural Exponential Function Using SSE

查看：33 发布时间：2021/12/20 15:56:04 c optimization vectorization sse simd

本文介绍了使用 SSE 最快实现自然指数函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找在 SSE 元素上运行的自然指数函数的近似值.即 - __m128 exp( __m128 x ).

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).

我有一个快速但似乎准确性很低的实现:

I have an implementation which is quick but seems to be very low in accuracy:

static inline __m128 FastExpSse(__m128 x)
{
    __m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
    __m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
    __m128  m87 = _mm_set1_ps(-87);
    // fast exponential function, x should be in [-87, 87]
    __m128 mask = _mm_cmpge_ps(x, m87);

    __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
    return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}

有人能有一个更准确但同样快(或更快)的实现吗?

Could anybody have an implementation with better accuracy yet as fast (Or faster)?

如果它是用 C 风格编写的，我会很高兴.

I'd be happy if it is written in C Style.

谢谢.

推荐答案

下面的 C 代码是我在之前对类似问题的回答.

The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question.

基本思想是将标准指数函数的计算转化为2的幂的计算:expf(x)=exp2f(x/logf(2.0f))=exp2f(x*1.44269504).我们将t = x * 1.44269504拆分成一个整数i和一个分数f，使得t = i + f> 和 0 <= f <= 1.我们现在可以用多项式近似计算 2^f，然后通过将 i 添加到单项的指数字段，将结果缩放 2ⁱ精度浮点结果.


The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * 1.44269504). We split t = x * 1.44269504 into an integer i and a fraction f, such that t = i + f and 0 <= f <= 1. We can now compute 2^f with a polynomial approximation, then scale the result by 2ⁱ by adding i to the exponent field of the single-precision  floating-point result.
SSE 实现存在的一个问题是我们想要计算 i = floorf (t)，但是没有快速的方法来计算 floor()功能.然而，我们观察到，对于正数，floor(x) == trunc(x)，对于负数，floor(x) == trunc(x) - 1code>，除非 x 是负整数.但是，由于核心近似可以处理 1.0f 的 f 值，因此对负参数使用近似是无害的.SSE 提供了将单精度浮点操作数转换为带截断的整数的指令，因此该解决方案是有效的.
One problem that exists with an SSE implementation is that we want to compute i = floorf (t), but there is no fast way to compute the floor() function. However, we observe that for positive numbers, floor(x) == trunc(x), and that for negative numbers, floor(x) == trunc(x) - 1, except when x is a negative integer. However, since the core approximation can handle an f value of 1.0f, using the approximation for negative arguments is harmless. SSE provides an instruction to convert single-precision floating point operands to integers with truncation, so this solution is efficient.
Peter Cordes 指出 SSE4.1 支持快速地板功能 _mm_floor_ps()，因此下面还显示了使用 SSE4.1 的变体.当启用 SSE 4.1 代码生成时，并非所有工具链都会自动预定义宏 __SSE4_1__，但 gcc 会.
Peter Cordes points out that SSE4.1 supports a fast floor function _mm_floor_ps(), so a variant using SSE4.1 is also shown below. Not all toolchains automatically predefine the macro __SSE4_1__ when SSE 4.1 code generation is enabled, but gcc does. 
Compiler Explorer (Godbolt) 显示 gcc 7.2 将以下代码编译为十六条指令，用于普通 SSE 和十二条说明，适用于 SSE 4.1.
Compiler Explorer (Godbolt) shows that gcc 7.2 compiles the code below into sixteen instructions for plain SSE and twelve instructions for SSE 4.1.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif

/* max. rel. error = 1.72863156e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 t, f, e, p, r;
    __m128i i, j;
    __m128 l2e = _mm_set1_ps (1.442695041f);  /* log2(e) */
    __m128 c0  = _mm_set1_ps (0.3371894346f);
    __m128 c1  = _mm_set1_ps (0.657636276f);
    __m128 c2  = _mm_set1_ps (1.00172476f);

    /* exp(x) = 2^i * 2^f; i = floor (log2(e) * x), 0 <= f <= 1 */   
    t = _mm_mul_ps (x, l2e);             /* t = log2(e) * x */
#ifdef __SSE4_1__
    e = _mm_floor_ps (t);                /* floor(t) */
    i = _mm_cvtps_epi32 (e);             /* (int)floor(t) */
#else /* __SSE4_1__*/
    i = _mm_cvttps_epi32 (t);            /* i = (int)t */
    j = _mm_srli_epi32 (_mm_castps_si128 (x), 31); /* signbit(t) */
    i = _mm_sub_epi32 (i, j);            /* (int)t - signbit(t) */
    e = _mm_cvtepi32_ps (i);             /* floor(t) ~= (int)t - signbit(t) */
#endif /* __SSE4_1__*/
    f = _mm_sub_ps (t, e);               /* f = t - floor(t) */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= 2^f */
    j = _mm_slli_epi32 (i, 23);          /* i << 23 */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}

int main (void)
{
    union {
        float f[4];
        unsigned int i[4];
    } arg, res;
    double relerr, maxrelerr = 0.0;
    int i, j;
    __m128 x, y;

    float start[2] = {-0.0f, 0.0f};
    float finish[2] = {-87.33654f, 88.72283f};

    for (i = 0; i < 2; i++) {

        arg.f[0] = start[i];
        arg.i[1] = arg.i[0] + 1;
        arg.i[2] = arg.i[0] + 2;
        arg.i[3] = arg.i[0] + 3;
        do {
            memcpy (&x, &arg, sizeof(x));
            y = fast_exp_sse (x);
            memcpy (&res, &y, sizeof(y));
            for (j = 0; j < 4; j++) {
                double ref = exp ((double)arg.f[j]);
                relerr = fabs ((res.f[j] - ref) / ref);
                if (relerr > maxrelerr) {
                    printf ("arg=% 15.8e  res=%15.8e  ref=%15.8e  err=%15.8e
", 
                            arg.f[j], res.f[j], ref, relerr);
                    maxrelerr = relerr;
                }
            }   
            arg.i[0] += 4;
            arg.i[1] += 4;
            arg.i[2] += 4;
            arg.i[3] += 4;
        } while (fabsf (arg.f[3]) < fabsf (finish[i]));
    }
    printf ("maximum relative errror = %15.8e
", maxrelerr);
    return EXIT_SUCCESS;
}

fast_sse_exp() 的另一种设计以舍入到最近模式提取调整参数 x/log(2) 的整数部分，使用井 -添加魔术"转换常数 1.5 * 2²³ 以强制在正确的位位置舍入，然后再次减去相同数字的已知技术.这就要求加法时生效的SSE舍入模式是舍入到最近或偶数"，这是默认的.wim在评论中指出，有些编译器可能会优化掉转换常量cvt 使用激进优化时作为冗余，干扰此代码序列的功能，因此建议检查生成的机器代码.计算 2^f 的近似区间现在以零为中心，因为 -0.5 <= f <= 0.5，需要不同的核心近似.

An alternative design for fast_sse_exp() extracts the integer portion of the adjusted argument x / log(2) in round-to-nearest mode, using the well-known technique of adding the "magic" conversion constant 1.5 * 2²³ to force rounding in the correct bit position, then subtracting out the same number again. This requires that the SSE rounding mode in effect during the addition is "round to nearest or even", which is the default. wim pointed out in comments that some compilers may optimize out the addition and subtraction of the conversion constant cvt as redundant when aggressive optimization is used, interfering with the functionality of this code sequence, so it is recommended to inspect the machine code generated. The approximation interval for computation of 2^f is now centered around zero, since -0.5 <= f <= 0.5, requiring a different core approximation.
/* max. rel. error <= 1.72860465e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 t, f, p, r;
    __m128i i, j;

    const __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
    const __m128 cvt = _mm_set1_ps (12582912.0f);  /* 1.5 * (1 << 23) */
    const __m128 c0 =  _mm_set1_ps (0.238428936f);
    const __m128 c1 =  _mm_set1_ps (0.703448006f);
    const __m128 c2 =  _mm_set1_ps (1.000443142f);

    /* exp(x) = 2^i * 2^f; i = rint (log2(e) * x), -0.5 <= f <= 0.5 */
    t = _mm_mul_ps (x, l2e);             /* t = log2(e) * x */
    r = _mm_sub_ps (_mm_add_ps (t, cvt), cvt); /* r = rint (t) */
    f = _mm_sub_ps (t, r);               /* f = t - rint (t) */
    i = _mm_cvtps_epi32 (t);             /* i = (int)t */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= exp2(f) */
    j = _mm_slli_epi32 (i, 23);          /* i << 23 */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}

问题中代码的算法似乎取自 Nicol N. Schraudolph 的作品，它巧妙地利用了 IEEE-754 二进制浮点格式的半对数性质:
The algorithm for the code in the question appears to be taken from the work of Nicol N. Schraudolph, which cleverly exploits the semi-logarithmic nature of IEEE-754 binary floating-point formats:
N.N. 施劳道夫.指数函数的快速、紧凑的近似." 神经计算，11(4)，1999 年 5 月，第 853-862 页.
N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
删除参数钳制代码后，它减少到只有三个 SSE 指令.神奇的"校正常数 486411 对于最小化整个输入域的最大相对误差并不是最佳的.基于简单的二分搜索，值 298765 似乎更好，将 FastExpSse() 的最大相对误差降低到 3.56e-2，而最大相对误差为 1.73e-3 表示 fast_exp_sse().
After removal of the argument clamping code, it reduces to just three SSE instructions. The "magical" correction constant 486411 is not optimal for minimizing maximum relative error over the entire input domain. Based on simple binary search, the value 298765 seems to be superior, reducing maximum relative error for FastExpSse() to 3.56e-2 vs. maximum relative error of 1.73e-3 for fast_exp_sse().
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Schraudolph 的算法基本上使用线性逼近 2^f ~= 1.0 + f for f in [0,1]，其精度可以通过添加二次项来改进.Schraudolph 方法的巧妙之处在于计算 2ⁱ * 2^f 而不显式地将整数部分 i = floor(x * 1.44269504) 与分数.我认为无法将该技巧扩展到二次近似，但当然可以将 Schraudolph 的 floor() 计算与上面使用的二次近似结合起来:
Schraudolph's algorithm basically uses the linear approximation 2^f ~= 1.0 + f for f in [0,1], and its accuracy could be improved by adding a quadratic term. The clever part of Schraudolph's approach is computing 2ⁱ * 2^f without explicitly separating the integer portion i = floor(x * 1.44269504) from the fraction. I see no way to extend that trick to a quadratic approximation, but one can certainly combine the floor() computation from Schraudolph with the quadratic approximation used above:
/* max. rel. error <= 1.72886892e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 f, p, r;
    __m128i t, j;
    const __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    const __m128i m = _mm_set1_epi32 (0xff800000); /* mask for integer bits */
    const __m128 ttm23 = _mm_set1_ps (1.1920929e-7f); /* exp2(-23) */
    const __m128 c0 = _mm_set1_ps (0.3371894346f);
    const __m128 c1 = _mm_set1_ps (0.657636276f);
    const __m128 c2 = _mm_set1_ps (1.00172476f);

    t = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
    j = _mm_and_si128 (t, m);            /* j = (int)(floor (x/log(2))) << 23 */
    t = _mm_sub_epi32 (t, j);
    f = _mm_mul_ps (ttm23, _mm_cvtepi32_ps (t)); /* f = (x/log(2)) - floor (x/log(2)) */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= 2^f */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}


                        这篇关于使用 SSE 最快实现自然指数函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用 SSE 最快实现自然指数函数 [英] Fastest Implementation of the Natural Exponential Function Using SSE

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 SSE 最快实现自然指数函数 [英] Fastest Implementation of the Natural Exponential Function Using SSE

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭