使用双精度运算实现快速的SSE低精度指数 [英] Fast SSE low precision exponential using double precision operations

查看:145
本文介绍了使用双精度运算实现快速的SSE低精度指数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找快速SSE低精度(〜1e-3)指数函数.

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.

我遇到了这个很棒的根据Nicol N. Schraudolph的著作:N. N. Schraudolph. 指数函数的快速紧凑近似." 《神经计算》,第11卷第4期,1999年5月,第853-862页.

Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

现在,我需要一个双精度"版本:__m128d FastExpSSE (__m128d x). 这是因为我无法控制输入和输出的精度,恰好是双精度,并且两次转换都是double-> float,然后float-> double占用了50%的CPU资源.

Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x). This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.

需要进行哪些更改?

我天真地尝试过:

__m128i double_to_uint64(__m128d x) {
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

__m128d FastExpSseDouble(__m128d x) {

    #define S 52
    #define C (1llu << S) / log(2)

    __m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
    __m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);

    auto y = double_to_uint64(_mm_mul_pd(a, x));

    __m128i t = _mm_add_epi64(y, b);
    return _mm_castsi128_pd(t);
}

当然,这会返回垃圾,因为我不知道自己在做什么...

Of course this returns garbage as I don't know what I'm doing...

关于50%的因子,这是一个非常粗略的估计,将加速(相对于std :: exp)转换为单精度数字(最大)的向量与使用双精度数字列表(不是太好了.

About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).

这是我使用的代码:

// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements

    const auto I = v.size();

    const auto N = (I / 4) * 4;

    for (int n = 0; n < N; n += 4) {

        float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };

        __m128 x;
        x = _mm_load_ps(a);

        auto r = FastExpSse(x);

        _mm_store_ps(a, r);

        v[n]     = a[0];
        v[n + 1] = a[1];
        v[n + 2] = a[2];
        v[n + 3] = a[3];
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }

}

如果我使用的是双精度"版本,这就是我要做的事情:

And here is what I would do if I had this "double precision" version:

void FastExpSseVectorDouble(std::vector<double> & v) {

    const auto I = v.size();

    const auto N = (I / 2) * 2;

    for (int n = 0; n < N; n += 2) {
        __m128d x;
        x = _mm_load_pd(&v[n]);
        auto r = FastExpSseDouble(x);

        _mm_store_pd(&v[n], r);
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }
}

推荐答案

类似的事情应该可以完成.您需要调整1.05常量以获得较小的最大误差-我太懒了:

Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:

__m128d fastexp(const __m128d &x)
{
    __m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));

    return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}

这仅能获得约2.5%的相对精度-为了获得更高的精度,您可能需要添加第二项.

This just gets about 2.5% relative precision -- for better precision you may need to add a second term.

此外,对于上溢或下溢的值,这将导致未指定的值,您可以通过将scaled值限制为某些值来避免这种情况.

Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

这篇关于使用双精度运算实现快速的SSE低精度指数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆