互补错误函数的矢量化实现erfcf() [英] Vectorizable implementation of complementary error function erfcf()

查看:275
本文介绍了互补错误函数的矢量化实现erfcf()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

互补错误函数erfc是一个与标准正态分布密切相关的特殊函数。在统计学和自然科学(如扩散问题)中,经常使用这种分布的尾巴,因此错误函数erf 的使用是不合适的。 / b>
$ b 补充性错误函数在ISO C99标准数学库中可用作函数 erfcf erfc erfcl ;这些随后被采用到ISO C ++中。因此源代码可以很容易地在该库的开源实现中找到,例如在

如何构建单精度函数的有效矢量化实现 erfcf()?以 ulp 衡量的准确度应该与glibc的标量实现大致相同,最大误差为3.12575 ulps(通过穷举测试确定)。由于所有主要的处理器架构(CPU和GPU)在这个时候都提供了融合乘加(FMA)的可用性,虽然处理浮点状态标志和 errno 可以忽略,但是应该根据ISO C的IEEE 754绑定来处理非规范,无穷和NaN。

解决方案

在研究了各种方法之后,似乎最适合的方法是以下文章中提出的算法:

M. Shepherd和JG Laframboise,Chebyshev(1 + 2×em)exp( x 2 )的近似值,在0≤ x <∞。计算数学,第36卷,第153号,1981年1月,第249-253页(
在线复制


本文是创建一个近似于(1 + 2 x )exp( x 2 )erfc( x ) ,我们可以通过简单地除以(1 + 2 x )和erfc(x)来计算erfcx( x X 2 )。函数的紧密有界范围,函数值大致在[1,1.3],而且它的一般平坦性很好地适用于多项式近似。这种方法的数值特性通过缩小近似间隔得到了进一步的改善:原始参数 x q =( x - K)/ ( x + K),其中K是一个适当选择的常数,然后计算 p 是一个多项式。

因为erfc - x = 2 - erfc x ,所以我们只需要考虑区间[0,∞ ]通过这个变换映射到区间[-1,1]。对于IEEE-754单精度,对于 x 10.0546875, erfcf()消失(变为零),所以只需要考虑 x ∈[0,10.0546875)。这个范围内K的最优值是多少?我知道没有数学分析可以提供答案,根据实验我们建议K = 3.75。

可以很容易地确定,对于单精度计算,9次方的最小多项式近似对于在该总体附近的K的各种值是足够的。利用Remez算法系统地产生这样的近似值,其中K在1.5和4之间变化,步长为1 / 16时,K = {2,2.625,3.3125}观察到最低的近似误差,其中K = 2是最有利的选择,因为它适用于非常精确的计算( x K)/( x + K),如此问题



值K = 2和 x 的输入域将暗示有必要使用变体4 q 的近似函数的非常浅的斜率>> -0.5,这会导致参数 q 中的任何错误减少大约十倍。



由于计算<$除了初始近似之外,c $ c> erfc()还需要后处理步骤,很显然,为了达到足够精确的最终结果,这两个计算的准确性必须很高。必须使用纠错技术。一个观察到(1 + 2×em)exp( x )的多项式近似中最显着的系数, erfc( x )的形式是(1 + s),其中 s < 0.5。这意味着我们可以通过分解1来更精确地表示前导系数,并且仅在多项式中使用 s 。因此,不是计算一个多项式p(q),然后乘以倒数r = r / 1 =核心近似为p( q )+ 1,并使用 p 来计算 fma(p,r,r)

通过计算来自倒数的 r 的初始商q可以提高除法的准确性,计算剩余的 e = p +1 - q *(1 + 2 x FMA,然后使用 e 应用修正 q = q +( e ),再次使用FMA。

幂指数具有错误放大特性,因此计算 sup> 必须小心执行。 FMA的可用性允许将 x 2 作为双精度的计算 - float s high < /子>:■<子>低。 e x 是它自己的导数,所以可以计算e s high :s low s high + e high * s low 。这个计算可以和前面中间结果的乘积相结合,从而得到 r = r * e s high + r * e s high * s low 通过使用FMA,可以确保尽可能准确地计算最显着的术语 * e s high 。 b
$ b

结合上面的步骤和一些简单的选择来处理异常情况和负面的参数,可以得到以下C代码:

  float my_expf(float); 
$ b / *
*基于:MM Shepherd和JG Laframboise,Chebyshev在$ 0中的近似
*(1 + 2x)exp(x ^ 2)erfc x = x *第153号,1981年1月,第249-253页。
* /
float my_erfcf(float x)
{
float a,d,e,m,p,q,r,s,t;

a = fabsf(x);

/ *准确计算q =(a-2)/(a + 2)。 [0,10.0546875] - > [-1,0.66818] * /
m = a - 2.0f;
p = a + 2.0f;
r = 1.0f / p;
q = m * r;
t = fmaf(q + 1.0f,-2.0f,a);
e = fmaf(q,-a,t);
q = fmaf(r,e,q);
$ b $ *在[-1,0.66818] * / $ b中,近似(1 + 2 * a)* exp(a * a)* erfc(a) $ bp = -0x1.a48024p-12f; // -4.01020574e-4
p = fmaf(p,q,-0x1.42a172p-10f); // -1.23073824e-3
p = fmaf(p,q,0x1.585784p-10f); // 1.31355994e-3
p = fmaf(p,q,0x1.1ade24p-07f); // 8.63243826e-3
p = fmaf(p,q,-0x1.081b72p-07f); // -8.05991236e-3
p = fmaf(p,q,-0x1.bc0b94p-05f); // -5.42047396e-2
p = fmaf(p,q,0x1.4ffc40p-03f); // 1.64055347e-1
p = fmaf(p,q,-0x1.540840p-03f); // -1.66031361e-1
p = fmaf(p,q,-0x1.7bf612p-04f); // -9.27639678e-2
p = fmaf(p,q,0x1.1ba03ap-02f); (1 + p)除以(1 + 2 * a)==>(2.76978403e-1

) exp(a * a)* erfc(a)* /
t = a + a;
d = t + 1.0f;
r = 1.0f / d;
q = fmaf(p,r,r); // q =(p + 1)/(1 + 2 * a)
e =(p-q)+ fmaf(q,-t,1.0f) //(p + 1)-q *(1 + 2 * a)
r = fmaf(e,r,q);

/ *乘以exp(-a * a)==> erfc(a)* /
s = a * a;
e = my_expf(-s);
t = fmaf(a,-a,s);
r = fmaf(r,e,r * e * t);
$ b $ *如果(!(a <= 0x1.fffffep127f))r = x + x,则将NaN参数处理为erfc()* /
;如果(a> 10.0546875f)r = 0.0f,
$ b $ *对大参数的钳位结果* /
;
$ b $ *如果(x <0.0f)r = 2.0f - r,则将负参数处理为erfc()* /


return r;
}

/ *计算指数基数e。最大ulp错误= 0.87161 * /
float my_expf(float a)
{
float c,f,r;
int i;

// exp(a)= exp(i + f); i = rint(a / log(2))
c = 0x1.800000p + 23f; // 1.25829120e + 7
r = fmaf(0x1.715476p + 0f,a,c) - c; // 1.44269502e + 0
f = fmaf(r,-0x1.62e400p-01f,a); // -6.93145752e-1 // log_2_hi
f = fmaf(r,-0x1.7f7d1cp-20f,f); // -1.42860677e-6 // log_2_lo
i =(int)r;
//近似值r = exp(f)间隔[-log(2)/ 2,+ log(2)/ 2]
r = 0x1.6a98dap-10f; // 1.38319808e-3
r = fmaf(r,f,0x1.1272cap-07f); // 8.37550033e-3
r = fmaf(r,f,0x1.555a20p-05f); // 4.16689515e-2
r = fmaf(r,f,0x1.55542ep-03f); // 1.66664466e-1
r = fmaf(r,f,0x1.fffff6p-02f); // 4.99999851e-1
r = fmaf(r,f,0x1.000000p + 00f); // 1.00000000e + 0
r = fmaf(r,f,0x1.000000p + 00f); // 1.00000000e + 0
// exp(a)= 2 ** i * exp(f);
r = ldexpf(r,i);
//处理特殊情况
if(!(fabsf(a)< 104.0f)){
r = a + a; //处理NaN
if(a <0.0f)r = 0.0f;
if(a> 0.0f)r = 1e38f * 1e38f; // + INF
}
return r;



$ b我使用了自己的 expf()在上面的代码中将我的工作与不同计算平台上的 expf()实现区别开来。但是,最大错误接近0.5 ulp的 expf()的任何实现都应该可以正常工作。如上所示,即使用 my_expf()时, my_erfcf()的最大错误为2.65712 ulps。提供了一个向量化 expf()的可用性,上面的代码应该没有问题的向量化。我用英特尔编译器13.1.3.198做了一个快速检查。我在循环中调用了 my_erfcf(),添加了 #include 通过调用 expf()调用 my_expf(),然后使用这些命令行开关进行编译:

  / Qstd = c99 / O3 / QxCORE-AVX2 / fp:precise / Qfma / Qimf-precision:high:expf / Qvec_report = 2 

英特尔编译器报告说,循环是矢量化的,我通过检查反汇编的二进制代码由于 my_erfcf()只使用倒数而不是完整的分割,所以可以使用快速倒数实现,只要它们提供了几乎正确的舍入结果。对于在硬件中提供快速单精度互逆逼近的处理器,可以通过将其与具有三次收敛的哈雷(Halley)迭代耦合来轻易实现。 x86处理器的这种方法的一个(标量)例子是:

$ $ p code $ / *计算1.0f / a几乎正确舍入。立方体收敛的哈雷迭代* /
float fast_recipf(float a)
{
__m128 t;
float e,r;
t = _mm_set_ss(a);
t = _mm_rcp_ss(t);
_mm_store_ss(& r,t);
e = fmaf(r,-a,1.0f);
e = fmaf(e,e,e);
r = fmaf(e,r,r);
return r;
}


The complementary error function, erfc, is a special functions closely related to the standard normal distribution. It is frequently used in statistics and the natural sciences (e.g. diffusion problems) where the "tails" of this distribution need to be considered, and use of the error function, erf, is therefore not suitable.

The complementary error function was made available in the ISO C99 standard math library as the functions erfcf, erfc, and erfcl; these were subsequently adopted into ISO C++ as well. Thus source code can readily be found in open-source implementations of that library, for example in glibc.

However, many existing implementations are scalar in nature, while modern processor hardware is SIMD-oriented (either explicitly, as in x86 CPU, or implicitly, as in GPUs). For performance reasons, a vectorizable implementation is therefore highly desirable. This means branches need to be avoided, except as part of select assignment. Likewise, extensive use of tables is not indicated, as parallelized lookup is often inefficient.

How would one go about constructing an efficient vectorizable implementation of the single-precision function erfcf()? The accuracy, as measured in ulp, should be roughly the same as glibc's scalar implementation, which has a maximum error of 3.12575 ulps (determined by exhaustive testing). The availability of fused multiply-add (FMA) can be assumed, as all major processor architectures (CPUs and GPUs) offer it at this time. While handling of floating-point status flags and errno can be ignored, denormals, infinities, and NaNs should be handled in accordance with the IEEE 754 bindings for ISO C.

解决方案

After looking into various approaches, the one that seems most suitable is the algorithm proposed in the following paper:

M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of (1 + 2 x) exp(x2) erfc x in 0 ≤ x < ∞." Mathematics of Computation, Volume 36, No. 153, January 1981, pp. 249-253 (online copy)

The basic idea of the paper is to create an approximation to (1 + 2 x) exp(x2) erfc(x), from which we can compute erfcx(x) by simply dividing by (1 + 2 x), and erfc(x) by then multiplying with exp(-x2). The tightly bounded range of the function, with function values roughly in [1, 1.3], and its general "flatness" lend itself well to polynomial approximation. Numerical properties of this approach are further improved by narrowing the approximation interval: the original argument x is transformed by q = (x - K) / (x + K), where K is a suitably chosen constant, followed by computing p (q), where p is a polynomial.

Since erfc -x = 2 - erfc x, we only need to consider the interval [0, ∞] which is mapped to the interval [-1, 1] by this transformation. For IEEE-754 single-precision, erfcf() vanishes (becomes zero) for x > 10.0546875, so one needs to consider only x ∈ [0, 10.0546875). What is the "optimal' value of K for this range? I know of no mathematical analysis that would provide the answer, the paper suggests K = 3.75 based on experiments.

One can readily establish that for single-precision computation, a minimax polynomial approximation of degree 9 is sufficient for various values of K in that general vicinity. Systematically generating such approximations with the Remez algorithm, with K varying between 1.5 and 4 in steps of 1/16, lowest approximation error is observed for K = {2, 2.625, 3.3125}. Of these, K = 2 is the most advantageous choice, since it lends itself to very accurate computation of (x - K) / (x + K), as shown in this question.

The value K = 2 and the input domain for x would suggest that it is necessary to use variant 4 from my answer, however once can demonstrate experimentally that the less expensive variant 5 achieves the same accuracy here, which is likely due to the very shallow slope of the approximated function for q > -0.5, which causes any error in the argument q to be reduced by roughly a factor of ten.

Since computation of erfc() requires post-processing steps in addition to the initial approximation, it is clear that the accuracy of both of these computations must be high in order to achieve a sufficiently accurate final result. Error correcting techniques must be used.

One observes that the most significant coefficient in the polynomial approximation of (1 + 2 x) exp(x2) erfc(x) is of the form (1 + s), where s < 0.5. This means we can represent the leading coefficient more accurately by splitting off 1, and only using s in the polynomial. So instead of computing a polynomial p(q), then multiplying by the reciprocal r = 1 / (1 + 2 x), it is mathematically equivalent but numerically advantageous to compute the core approximation as p(q) + 1, and use p to compute fma (p, r, r).

The accuracy of the division can be enhanced by computing an initial quotient q from the reciprocal r, compute the residual e = p+1 - q * (1 + 2 x) with the help of an FMA, then use e to apply the correction q = q + (e * r), again using an FMA.

Exponentiation has error magnification properties, therefore computation of e-x2 must be performed carefully. The availability of FMA trivially allows the computation of -x2 as a double-float shigh:slow. ex is its own derivative, so one can compute eshigh:slow as eshigh + eshigh * slow. This computation can be combined with the multiplication of the previous intermediate result r to yield r = r * eshigh + r * eshigh * slow. By use of FMA, one ensures that the most significant term r * eshigh is computed as accurately as possible.

Combining the steps above with a few simple selections to handle exceptional cases and negative arguments, one arrives at the following C code:

float my_expf (float);

/*  
 * Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of 
 * (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
 * No. 153, January 1981, pp. 249-253.  
 */  
float my_erfcf (float x)
{
    float a, d, e, m, p, q, r, s, t;

    a = fabsf (x); 

    /* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
    m = a - 2.0f;
    p = a + 2.0f;
    r = 1.0f / p;
    q = m * r;
    t = fmaf (q + 1.0f, -2.0f, a); 
    e = fmaf (q, -a, t); 
    q = fmaf (r, e, q); 

    /* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
    p =             -0x1.a48024p-12f;  // -4.01020574e-4
    p = fmaf (p, q, -0x1.42a172p-10f); // -1.23073824e-3
    p = fmaf (p, q,  0x1.585784p-10f); //  1.31355994e-3
    p = fmaf (p, q,  0x1.1ade24p-07f); //  8.63243826e-3
    p = fmaf (p, q, -0x1.081b72p-07f); // -8.05991236e-3
    p = fmaf (p, q, -0x1.bc0b94p-05f); // -5.42047396e-2
    p = fmaf (p, q,  0x1.4ffc40p-03f); //  1.64055347e-1
    p = fmaf (p, q, -0x1.540840p-03f); // -1.66031361e-1
    p = fmaf (p, q, -0x1.7bf612p-04f); // -9.27639678e-2
    p = fmaf (p, q,  0x1.1ba03ap-02f); //  2.76978403e-1

    /* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
    t = a + a;
    d = t + 1.0f;
    r = 1.0f / d;
    q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
    e = (p - q) + fmaf (q, -t, 1.0f); // (p+1) - q*(1+2*a)
    r = fmaf (e, r, q);

    /* Multiply by exp(-a*a) ==> erfc(a) */
    s = a * a; 
    e = my_expf (-s);  
    t = fmaf (a, -a, s);
    r = fmaf (r, e, r * e * t);

    /* Handle NaN arguments to erfc() */
    if (!(a <= 0x1.fffffep127f)) r = x + x;

    /* Clamp result for large arguments */
    if (a > 10.0546875f) r = 0.0f;

    /* Handle negative arguments to erfc() */
    if (x < 0.0f) r = 2.0f - r; 

    return r;
}

/* Compute exponential base e. Maximum ulp error = 0.87161 */
float my_expf (float a)
{
    float c, f, r;
    int i;

    // exp(a) = exp(i + f); i = rint (a / log(2)) 
    c = 0x1.800000p+23f; // 1.25829120e+7
    r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
    f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi 
    f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
    i = (int)r;
    // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
    r =             0x1.6a98dap-10f;  // 1.38319808e-3
    r = fmaf (r, f, 0x1.1272cap-07f); // 8.37550033e-3
    r = fmaf (r, f, 0x1.555a20p-05f); // 4.16689515e-2
    r = fmaf (r, f, 0x1.55542ep-03f); // 1.66664466e-1
    r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    // exp(a) = 2**i * exp(f);
    r = ldexpf (r, i);
    // handle special cases
    if (!(fabsf (a) < 104.0f)) {
        r = a + a; // handle NaNs
        if (a < 0.0f) r = 0.0f;
        if (a > 0.0f) r = 1e38f * 1e38f; // + INF
    }
    return r;
}

I used my own implementation of expf() in the above code to isolate my work from differences in the expf() implementations on different compute platforms. But any implementation of expf() whose maximum error is close to 0.5 ulp should work well. As shown above, that is, when using my_expf(), my_erfcf() has a maximum error of 2.65712 ulps. Provided availability of a vectorizable expf(), the code above should vectorize without problem. I did a quick check with the Intel compiler 13.1.3.198. I put a call to my_erfcf() in a loop, added #include <mathimf.h>, replaced the call to my_expf() with a call to expf(), then compiled using these command line switches:

/Qstd=c99 /O3 /QxCORE-AVX2 /fp:precise /Qfma /Qimf-precision:high:expf /Qvec_report=2

The Intel compiler reported that the loop had been vectorized, which I double checked by inspection of the disassembled binary code.

Since my_erfcf() only uses reciprocals rather than full divisions, it is amenable to the use of fast reciprocal implementations, provided they deliver almost correctly-rounded results. For processors that provide a fast single-precision reciprocal approximation in hardware, this can easily be achieved by coupling this with a Halley iteration with cubic convergence. A (scalar) example of this approach for x86 processors is:

/* Compute 1.0f / a almost correctly rounded. Halley iteration with cubic convergence */
float fast_recipf (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}

这篇关于互补错误函数的矢量化实现erfcf()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆