# 使用 SSE 最快实现自然指数函数 [英] Fastest Implementation of the Natural Exponential Function Using SSE

### 问题描述

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - `__m128 exp( __m128 x )`.

I have an implementation which is quick but seems to be very low in accuracy:

``````static inline __m128 FastExpSse(__m128 x)
{
__m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128  m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]

__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
}
``````

Could anybody have an implementation with better accuracy yet as fast (Or faster)?

I'd be happy if it is written in C Style.

### 推荐答案

The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question.

``` The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * 1.44269504). We split t = x * 1.44269504 into an integer i and a fraction f, such that t = i + f and 0 <= f <= 1. We can now compute 2f with a polynomial approximation, then scale the result by 2i by adding i to the exponent field of the single-precision floating-point result. SSE 实现存在的一个问题是我们想要计算 i = floorf (t)，但是没有快速的方法来计算 floor()功能.然而，我们观察到，对于正数，floor(x) == trunc(x)，对于负数，floor(x) == trunc(x) - 1code>，除非 x 是负整数.但是，由于核心近似可以处理 1.0f 的 f 值，因此对负参数使用近似是无害的.SSE 提供了将单精度浮点操作数转换为带截断的整数的指令，因此该解决方案是有效的. One problem that exists with an SSE implementation is that we want to compute i = floorf (t), but there is no fast way to compute the floor() function. However, we observe that for positive numbers, floor(x) == trunc(x), and that for negative numbers, floor(x) == trunc(x) - 1, except when x is a negative integer. However, since the core approximation can handle an f value of 1.0f, using the approximation for negative arguments is harmless. SSE provides an instruction to convert single-precision floating point operands to integers with truncation, so this solution is efficient. Peter Cordes 指出 SSE4.1 支持快速地板功能 _mm_floor_ps()，因此下面还显示了使用 SSE4.1 的变体.当启用 SSE 4.1 代码生成时，并非所有工具链都会自动预定义宏 __SSE4_1__，但 gcc 会. Peter Cordes points out that SSE4.1 supports a fast floor function _mm_floor_ps(), so a variant using SSE4.1 is also shown below. Not all toolchains automatically predefine the macro __SSE4_1__ when SSE 4.1 code generation is enabled, but gcc does. Compiler Explorer (Godbolt) 显示 gcc 7.2 将以下代码编译为十六条指令，用于普通 SSE 和十二条说明，适用于 SSE 4.1. Compiler Explorer (Godbolt) shows that gcc 7.2 compiles the code below into sixteen instructions for plain SSE and twelve instructions for SSE 4.1. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <math.h> #include <emmintrin.h> #ifdef __SSE4_1__ #include <smmintrin.h> #endif /* max. rel. error = 1.72863156e-3 on [-87.33654, 88.72283] */ __m128 fast_exp_sse (__m128 x) { __m128 t, f, e, p, r; __m128i i, j; __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */ __m128 c0 = _mm_set1_ps (0.3371894346f); __m128 c1 = _mm_set1_ps (0.657636276f); __m128 c2 = _mm_set1_ps (1.00172476f); /* exp(x) = 2^i * 2^f; i = floor (log2(e) * x), 0 <= f <= 1 */ t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */ #ifdef __SSE4_1__ e = _mm_floor_ps (t); /* floor(t) */ i = _mm_cvtps_epi32 (e); /* (int)floor(t) */ #else /* __SSE4_1__*/ i = _mm_cvttps_epi32 (t); /* i = (int)t */ j = _mm_srli_epi32 (_mm_castps_si128 (x), 31); /* signbit(t) */ i = _mm_sub_epi32 (i, j); /* (int)t - signbit(t) */ e = _mm_cvtepi32_ps (i); /* floor(t) ~= (int)t - signbit(t) */ #endif /* __SSE4_1__*/ f = _mm_sub_ps (t, e); /* f = t - floor(t) */ p = c0; /* c0 */ p = _mm_mul_ps (p, f); /* c0 * f */ p = _mm_add_ps (p, c1); /* c0 * f + c1 */ p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */ p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */ j = _mm_slli_epi32 (i, 23); /* i << 23 */ r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/ return r; } int main (void) { union { float f[4]; unsigned int i[4]; } arg, res; double relerr, maxrelerr = 0.0; int i, j; __m128 x, y; float start[2] = {-0.0f, 0.0f}; float finish[2] = {-87.33654f, 88.72283f}; for (i = 0; i < 2; i++) { arg.f[0] = start[i]; arg.i[1] = arg.i[0] + 1; arg.i[2] = arg.i[0] + 2; arg.i[3] = arg.i[0] + 3; do { memcpy (&x, &arg, sizeof(x)); y = fast_exp_sse (x); memcpy (&res, &y, sizeof(y)); for (j = 0; j < 4; j++) { double ref = exp ((double)arg.f[j]); relerr = fabs ((res.f[j] - ref) / ref); if (relerr > maxrelerr) { printf ("arg=% 15.8e res=%15.8e ref=%15.8e err=%15.8e ", arg.f[j], res.f[j], ref, relerr); maxrelerr = relerr; } } arg.i[0] += 4; arg.i[1] += 4; arg.i[2] += 4; arg.i[3] += 4; } while (fabsf (arg.f[3]) < fabsf (finish[i])); } printf ("maximum relative errror = %15.8e ", maxrelerr); return EXIT_SUCCESS; } fast_sse_exp() 的另一种设计以舍入到最近模式提取调整参数 x/log(2) 的整数部分，使用井 -添加魔术"转换常数 1.5 * 223 以强制在正确的位位置舍入，然后再次减去相同数字的已知技术.这就要求加法时生效的SSE舍入模式是舍入到最近或偶数"，这是默认的.wim在评论中指出，有些编译器可能会优化掉转换常量cvt 使用激进优化时作为冗余，干扰此代码序列的功能，因此建议检查生成的机器代码.计算 2f 的近似区间现在以零为中心，因为 -0.5 <= f <= 0.5，需要不同的核心近似. An alternative design for fast_sse_exp() extracts the integer portion of the adjusted argument x / log(2) in round-to-nearest mode, using the well-known technique of adding the "magic" conversion constant 1.5 * 223 to force rounding in the correct bit position, then subtracting out the same number again. This requires that the SSE rounding mode in effect during the addition is "round to nearest or even", which is the default. wim pointed out in comments that some compilers may optimize out the addition and subtraction of the conversion constant cvt as redundant when aggressive optimization is used, interfering with the functionality of this code sequence, so it is recommended to inspect the machine code generated. The approximation interval for computation of 2f is now centered around zero, since -0.5 <= f <= 0.5, requiring a different core approximation. /* max. rel. error <= 1.72860465e-3 on [-87.33654, 88.72283] */ __m128 fast_exp_sse (__m128 x) { __m128 t, f, p, r; __m128i i, j; const __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */ const __m128 cvt = _mm_set1_ps (12582912.0f); /* 1.5 * (1 << 23) */ const __m128 c0 = _mm_set1_ps (0.238428936f); const __m128 c1 = _mm_set1_ps (0.703448006f); const __m128 c2 = _mm_set1_ps (1.000443142f); /* exp(x) = 2^i * 2^f; i = rint (log2(e) * x), -0.5 <= f <= 0.5 */ t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */ r = _mm_sub_ps (_mm_add_ps (t, cvt), cvt); /* r = rint (t) */ f = _mm_sub_ps (t, r); /* f = t - rint (t) */ i = _mm_cvtps_epi32 (t); /* i = (int)t */ p = c0; /* c0 */ p = _mm_mul_ps (p, f); /* c0 * f */ p = _mm_add_ps (p, c1); /* c0 * f + c1 */ p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */ p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= exp2(f) */ j = _mm_slli_epi32 (i, 23); /* i << 23 */ r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/ return r; } 问题中代码的算法似乎取自 Nicol N. Schraudolph 的作品，它巧妙地利用了 IEEE-754 二进制浮点格式的半对数性质: The algorithm for the code in the question appears to be taken from the work of Nicol N. Schraudolph, which cleverly exploits the semi-logarithmic nature of IEEE-754 binary floating-point formats: N.N. 施劳道夫.指数函数的快速、紧凑的近似." 神经计算，11(4)，1999 年 5 月，第 853-862 页. N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862. 删除参数钳制代码后，它减少到只有三个 SSE 指令.神奇的"校正常数 486411 对于最小化整个输入域的最大相对误差并不是最佳的.基于简单的二分搜索，值 298765 似乎更好，将 FastExpSse() 的最大相对误差降低到 3.56e-2，而最大相对误差为 1.73e-3 表示 fast_exp_sse(). After removal of the argument clamping code, it reduces to just three SSE instructions. The "magical" correction constant 486411 is not optimal for minimizing maximum relative error over the entire input domain. Based on simple binary search, the value 298765 seems to be superior, reducing maximum relative error for FastExpSse() to 3.56e-2 vs. maximum relative error of 1.73e-3 for fast_exp_sse(). /* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */ __m128 FastExpSse (__m128 x) { __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765); __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b); return _mm_castsi128_ps (t); } Schraudolph 的算法基本上使用线性逼近 2f ~= 1.0 + f for f in [0,1]，其精度可以通过添加二次项来改进.Schraudolph 方法的巧妙之处在于计算 2i * 2f 而不显式地将整数部分 i = floor(x * 1.44269504) 与分数.我认为无法将该技巧扩展到二次近似，但当然可以将 Schraudolph 的 floor() 计算与上面使用的二次近似结合起来: Schraudolph's algorithm basically uses the linear approximation 2f ~= 1.0 + f for f in [0,1], and its accuracy could be improved by adding a quadratic term. The clever part of Schraudolph's approach is computing 2i * 2f without explicitly separating the integer portion i = floor(x * 1.44269504) from the fraction. I see no way to extend that trick to a quadratic approximation, but one can certainly combine the floor() computation from Schraudolph with the quadratic approximation used above: /* max. rel. error <= 1.72886892e-3 on [-87.33654, 88.72283] */ __m128 fast_exp_sse (__m128 x) { __m128 f, p, r; __m128i t, j; const __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ const __m128i m = _mm_set1_epi32 (0xff800000); /* mask for integer bits */ const __m128 ttm23 = _mm_set1_ps (1.1920929e-7f); /* exp2(-23) */ const __m128 c0 = _mm_set1_ps (0.3371894346f); const __m128 c1 = _mm_set1_ps (0.657636276f); const __m128 c2 = _mm_set1_ps (1.00172476f); t = _mm_cvtps_epi32 (_mm_mul_ps (a, x)); j = _mm_and_si128 (t, m); /* j = (int)(floor (x/log(2))) << 23 */ t = _mm_sub_epi32 (t, j); f = _mm_mul_ps (ttm23, _mm_cvtepi32_ps (t)); /* f = (x/log(2)) - floor (x/log(2)) */ p = c0; /* c0 */ p = _mm_mul_ps (p, f); /* c0 * f */ p = _mm_add_ps (p, c1); /* c0 * f + c1 */ p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */ p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */ r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/ return r; } 这篇关于使用 SSE 最快实现自然指数函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！ ```
``` 查看全文 ```
``` ```
``` ```
``` (adsbygoogle = window.adsbygoogle || []).push({}); 相关文章 使用SSE最快实现自然指数函数; 使用 AVX 最快实现指数函数; 使用AVX最快实现指数函数; Python指数函数的源代码?; 指数函数-绘制最小值; 指数函数的计算精度; 指数函数的大O符号; Ç - 写作的指数函数，而无需使用战俘; DEXP或EXP用于fortran中的指数函数？; Z3 可以处理正弦函数和指数函数吗; GNU C++标准库使用哪种算法来计算指数函数？; 在GNU C ++标准库中使用哪种算法计算指数函数?; 转换指数函数和取整的数字在BASH; 使用双精度运算实现快速的SSE低精度指数; 如何使用d3.js轴功能绘制指数函数y = ab ^ x; 使用scipy curve_fit通过两个数据点拟合指数函数; GLSL双精度角度，触发和指数函数解决方法; 可以在不使用exp内置函数的情况下执行指数函数的MATLAB代码; 使用 sse 内在函数对 (A)RGB32 图像进行最快 50% 缩放; 使用 SSE 计算绝对值的最快方法; 使用 rxjs 实现指数退避; 使用rxjs实现指数补偿; R-具有2个参数的指数函数的Optimx-无法在初始参数值处评估函数; 如何使用指数回退实现HttpRequestRetryHandler?; 使用SSE内在函数进行矩阵计算; (adsbygoogle = window.adsbygoogle || []).push({}); ```
``` 其他开发最新文章 拒绝显示一个框架，因为它将'X-Frame-Options'设置为'sameorigin'; 什么是＆QUOT; AW＆QUOT;在部分标志属性是什么意思？; 在运行npm install命令时获取'npm WARN弃用'警告; cmake无法找到openssl; 从Spark的scala中的* .tar.gz压缩文件中读取HDF5文件; Twitter :: Error :: Forbidden - 无法验证您的凭据; 我什么时候需要一个fb：app_id或者fb：admins？; 将.db文件导入R; npm通知创建一个lockfile作为package-lock.json。你应该提交这个文件; 拒绝执行内联脚本，因为它违反了以下内容安全策略指令：“script-src'self'”; 热门教程 Java教程 Apache ANT 教程 Kali Linux教程 JavaScript教程 JavaFx教程 MFC 教程 Apache HTTP客户端教程 Microsoft Visio 教程 热门工具 Java 在线工具 C(GCC) 在线工具 PHP 在线工具 C# 在线工具 Python 在线工具 MySQL 在线工具 VB.NET 在线工具 Lua 在线工具 Oracle 在线工具 C++(GCC) 在线工具 Go 在线工具 Fortran 在线工具 ```
``` var eskeys = '使用,sse,最快,实现,自然,指数函数'; var cat = 'other-dev'; ```
``` 登录 关闭 扫码关注1秒登录 发送“验证码”获取 | 15天全站免登陆 友情链接： IT屋 Chrome插件 谷歌浏览器插件 IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图 站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除. var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?0c3a090f7b3c4ad458ac1296cb5cc779"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); (function () { var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); ```