理解为什么 ASM fsqrt 实现比标准 sqrt 函数快 [英] Understanding why an ASM fsqrt implementation is faster than the standard sqrt function

查看:73
本文介绍了理解为什么 ASM fsqrt 实现比标准 sqrt 函数快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于学术目的,我在 C++ 中尝试了基本的数学函数实现.今天,我对平方根的以下代码进行了基准测试:

I have playing around with basic math function implementations in C++ for academic purposes. Today, I benchmarked the following code for Square Root:

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

我惊讶地发现它始终比标准的 sqrt 函数快(它需要标准函数执行时间的 85% 左右).

I was surprised to see that it is consistently faster than the standard sqrt function (it takes around 85% of the execution time of the standard function).

我不太明白为什么,很想更好地理解它.下面我展示了我用来分析的完整代码(在 Visual Studio 2015 中,在发布模式下编译并打开所有优化):

I don't quite get why and would love to better understand it. Below I show the full code I am using to profile (in Visual Studio 2015, compiling in Release mode and with all optimizations turned on):

#include <iostream>
#include <random>
#include <chrono>
#define M 1000000

float ranfloats[M];

using namespace std;

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

int main()
{
    default_random_engine randomGenerator(time(0));
    uniform_real_distribution<float> diceroll(0.0f , 1.0f);

chrono::high_resolution_clock::time_point start1, start2;
chrono::high_resolution_clock::time_point end1, end2;
float sqrt1 = 0;
float sqrt2 = 0;


for (int i = 0; i<M; i++) ranfloats[i] = diceroll(randomGenerator);


start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);
end1 = std::chrono::high_resolution_clock::now();

start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);
end2 = std::chrono::high_resolution_clock::now();

auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
auto time2  = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();


cout << "Time elapsed for SQRT1: " << time1 << " seconds" << endl;
cout << "Time elapsed for SQRT2: " << time2 << " seconds" << endl;

cout << "Average of Time for SQRT2 / Time for SQRT1: " << time2 / time1  << endl;
cout << "Equal to standard sqrt? " << (sqrt1 == sqrt2) << endl;



system("pause");
return 0;
}

我正在编辑问题以包含两个循环的反汇编代码,这些循环在 Visual Studio 2015 中计算平方根.

I am editing the question to include disassembly codes of both loops that calculate square roots as they came at Visual Studio 2015.

首先对的反汇编for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);:

00091194 0F 5A C0             cvtps2pd    xmm0,xmm0  
00091197 E8 F2 18 00 00       call        __libm_sse2_sqrt_precise (092A8Eh)  
0009119C F2 0F 5A C0          cvtsd2ss    xmm0,xmm0  
000911A0 83 C6 04             add         esi,4  
000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0  
000911AF 81 FE 90 5C 46 00    cmp         esi,offset __dyn_tls_dtor_callback (0465C90h)  
000911B5 7C D9                jl          main+190h (091190h) 

接下来,对for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);的反汇编:

00091290 F3 0F 10 00          movss       xmm0,dword ptr [eax]  
00091294 F3 0F 11 44 24 6C    movss       dword ptr [esp+6Ch],xmm0  
0009129A D9 44 24 6C          fld         dword ptr [esp+6Ch]  
0009129E D9 FA                fsqrt  
000912A0 D9 5C 24 6C          fstp        dword ptr [esp+6Ch]  
000912A4 F3 0F 10 44 24 6C    movss       xmm0,dword ptr [esp+6Ch]  
000912AA 83 C0 04             add         eax,4  
000912AD F3 0F 58 44 24 54    addss       xmm0,dword ptr [esp+54h]  
000912B3 F3 0F 11 44 24 54    movss       dword ptr [esp+54h],xmm0  
000912B9 ??                   ?? ?? 
000912BA ??                   ?? ?? 
000912BB ??                   ?? ?? 
000912BC ??                   ?? ?? 
000912BD ??                   ?? ?? 
000912BE ??                   ?? ?? 
000912BF ??                   ?? ?? 
000912C0 ??                   ?? ?? 
000912C1 ??                   ?? ?? 
000912C2 ??                   ?? ?? 
000912C3 ??                   ?? ?? 
000912C4 ??                   ?? ?? 
000912C5 ??                   ?? ?? 
000912C6 ??                   ?? ?? 
000912C7 ??                   ?? ?? 
000912C8 ??                   ?? ?? 
000912C9 ??                   ?? ?? 
000912CA ??                   ?? ?? 
000912CB ??                   ?? ?? 
000912CC ??                   ?? ?? 
000912CD ??                   ?? ?? 
000912CE ??                   ?? ?? 
000912CF ??                   ?? ?? 
000912D0 ??                   ?? ?? 
000912D1 ??                   ?? ?? 
000912D2 ??                   ?? ?? 
000912D3 ??                   ?? ?? 
000912D4 ??                   ?? ?? 
000912D5 ??                   ?? ?? 
000912D6 ??                   ?? ?? 
000912D7 ??                   ?? ?? 
000912D8 ??                   ?? ?? 
000912D9 ??                   ?? ?? 
000912DA ??                   ?? ?? 
000912DB ??                   ?? ?? 
000912DC ??                   ?? ?? 
000912DD ??                   ?? ?? 
000912DE ??                   ?? ?? 

推荐答案

你的两个循环都非常糟糕,除了 sqrt 函数调用或 FSQRT 指令之外还有许多瓶颈.并且至少比最佳标量 SQRTSS(单精度)代码慢 2 倍.这可能比体面的 SSE2 矢量化循环可以实现的速度慢 8 倍.即使不重新排序任何数学运算,您也可以击败 SQRTSS 吞吐量.

Both your loops come out pretty horrible, with many bottlenecks other than the sqrt function call or the FSQRT instruction. And at least 2x slower than optimal scalar SQRTSS (single-precision) code could do. And that's maybe 8x slower than what a decent SSE2 vectorized loop could achieve. Even without reordering any math operations, you could beat SQRTSS throughput.

许多来自 https://gcc.gnu.org/wiki/DontUseInlineAsm 的原因适用于您的示例.编译器将无法通过您的函数传播常量,并且它不会知道结果总是非负的(如果它不是 NaN).如果您稍后对数字进行平方,它也将无法将其优化为 fabs().

Many of the reasons from https://gcc.gnu.org/wiki/DontUseInlineAsm apply to your example. The compiler won't be able to propagate constants through your function, and it won't know that the result is alway non-negative (if it isn't NaN). It also won't be able to optimize it into an fabs() if you later square the number.

同样非常重要的是,您可以使用 SSE2 SQRTPS (_mm_sqrt_ps()) 击败自动矢量化.使用内在函数的无错误检查"标量 sqrt() 函数也存在该问题.IDK 如果没有 /fp:fast 有什么方法可以获得最佳结果,但我对此表示怀疑.(除了在汇编中编写整个循环,或者自己使用内部函数向量化整个循环).

Also highly important, you defeat auto-vectorization with SSE2 SQRTPS (_mm_sqrt_ps()). A "no-error-checking" scalar sqrt() function using intrinsics also suffers from that problem. IDK if there's any way to get optimal results without /fp:fast, but I doubt it. (Other than writing a whole loop in assembly, or vectorizing the whole loop yourself with intrinsics).

令人印象深刻的是,您的 Haswell CPU 设法尽可能快地运行函数调用循环,尽管内联 asm 循环甚至可能不会使 FSQRT 吞吐量饱和.

It's pretty impressive that your Haswell CPU manages to run the function-call loop as fast as it does, although the inline-asm loop may not even be saturating FSQRT throughput either.

出于某种原因,您的库函数调用正在调用 double sqrt(double),而不是 C++ 重载 float sqrt(float).这会导致转换为 double 并返回到 float.可能你需要 #include 来获得重载,或者你可以调用sqrtf().Linux 上的 gcc 和 clang 使用您当前的代码调用 sqrtf()(不转换为 double 和 back),但也许它们的 标头恰好包含 ,而 MSVC 没有.或者可能有其他事情发生.

For some reason, your library function call is calling double sqrt(double), not the C++ overload float sqrt(float). This leads to a conversion to double and back to float. Probably you need to #include <cmath> to get the overloads, or you could call sqrtf(). gcc and clang on Linux call sqrtf() with your current code (without converting to double and back), but maybe their <random> header happens to include <cmath>, and MSVC's doesn't. Or maybe there's something else going on.

库函数调用循环将总和保存在内存中(而不是寄存器).显然,__libm_sse2_sqrt_precise 的 32 位版本使用的调用约定不保留任何 XMM 寄存器.Windows x64 ABI 确实保留了 XMM6-XMM15,但维基百科说这是新的,而 32 位 ABI 没有不要那样做.我假设如果有任何保留调用的 XMM 寄存器,MSVC 的优化器会利用它们.

The library function-call loop keeps the sum in memory (instead of a register). Apparently the calling convention used by the 32-bit version of __libm_sse2_sqrt_precise doesn't preserve any XMM registers. The Windows x64 ABI does preserve XMM6-XMM15, but wikipedia says this is new and the 32-bit ABI didn't do that. I assume if there were any call-preserved XMM registers, MSVC's optimizer would take advantage of them.

无论如何,除了在每个独立的标量浮点数上调用 sqrt 的吞吐量瓶颈之外,对 sqrt1 的循环携带依赖是一个延迟瓶颈,包括存储转发往返:

Anyway, besides the throughput bottleneck of calling sqrt on each independent scalar float, the loop-carried dependency on sqrt1 is a latency bottleneck that includes a store-forwarding round trip:

000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0  

乱序执行让每次迭代的其余代码重叠,所以你只是吞吐量的瓶颈,但无论库 sqrt 函数有多高效,这个延迟瓶颈将循环限制为每 6 + 3 = 9 次迭代循环.(Haswell ADDSS 延迟 = 3,XMM 加载/存储的存储转发延迟 = 6 个周期.比整数寄存器的存储转发多 1 个周期.参见 Agner Fog 的指令表.)

Out of order execution lets rest of the code for each iteration overlap, so you just bottleneck on throughput, but no matter how efficient the library sqrt function is, this latency bottleneck limits the loop to one iteration per 6 + 3 = 9 cycles. (Haswell ADDSS latency = 3, store-forwarding latency for XMM load/store = 6 cycles. 1 cycle more than store-forwarding for integer registers. See Agner Fog's instruction tables.)

SQRTSD 每 8-14 个周期的吞吐量为 1,因此循环携带的依赖性不是 Haswell 上的限制瓶颈.

SQRTSD has a throughput of one per 8-14 cycles, so the loop-carried dependency is not the limiting bottleneck on Haswell.

inline-asm 版本 具有用于 sqrt 结果的存储/重新加载往返,但它不是循环携带的依赖链的一部分.MSVC inline-asm 语法使得很难避免 store-转发往返以将数据输入/输出内联汇编.但更糟糕的是,您在 x87 堆栈上生成结果,而编译器想要在 XMM 寄存器中进行 SSE 数学运算.

The inline-asm version with has a store/reload round trip for the sqrt result, but it's not part of the loop-carried dependency chain. MSVC inline-asm syntax makes it hard to avoid store-forwarding round trips to get data into / out of inline asm. But worse, you produce the result on the x87 stack, and the compiler wants to do SSE math in XMM registers.

然后 MSVC 无缘无故地在脚下射击,将总和保存在内存中而不是 XMM 寄存器中.它查看 inline-asm 语句以查看它们影响哪些寄存器,所以 IDK 为什么它看不到您的 inline-asm 语句不会破坏任何 XMM regs.

And then MSVC shoots itself in the foot for no reason, keeping the sum in memory instead of in an XMM register. It looks inside inline-asm statements to see which registers they affect, so IDK why it doesn't see that your inline-asm statement doesn't clobber any XMM regs.

所以 MSVC 在这里做了一个比必要的更糟糕的工作:

So MSVC does a much worse job than necessary here:

00091290     movss       xmm0,dword ptr [eax]       # load from the array
00091294     movss       dword ptr [esp+6Ch],xmm0   # store to the stack
0009129A     fld         dword ptr [esp+6Ch]        # x87 load from stack
0009129E     fsqrt  
000912A0     fstp        dword ptr [esp+6Ch]        # x87 store to the stack
000912A4     movss       xmm0,dword ptr [esp+6Ch]   # SSE load from the stack (of sqrt(array[i]))
000912AA     add         eax,4  
000912AD     addss       xmm0,dword ptr [esp+54h]   # SSE load+add of the sum
000912B3     movss       dword ptr [esp+54h],xmm0   # SSE store of the sum

因此它具有与函数调用循环相同的循环携带依赖链(ADDSS + 存储转发).Haswell FSQRT 每 8-17 个周期有一个吞吐量,所以它可能仍然是瓶颈.(所有涉及数组值的存储/重新加载对于每次迭代都是独立的,无序执行可以重叠多次迭代以隐藏该延迟链.但是,它们会阻塞加载/存储执行单元,有时会延迟关键-path 加载/存储一个额外的周期.这称为资源冲突.)

So it has the same loop-carried dependency chain (ADDSS + store-forwarding) as the function-call loop. Haswell FSQRT has one per 8-17 cycle throughput, so probably it's still the bottleneck. (All the stores/reloads involving the array value are independent for each iteration, and out-of-order execution can overlap many iterations to hide that latency chain. However, they will clog up the load/store execution units and sometimes delay the critical-path loads/stores by an extra cycle. This is called a resource conflict.)

如果没有 /fp:fastsqrtf() 库函数必须设置 errno 如果结果是 NaN.这就是它不能内联到 SQRTSS 的原因.

Without /fp:fast, the sqrtf() library function has to set errno if the result is NaN. This is why it can't inline to just a SQRTSS.

如果您确实想自己实现一个无检查标量 sqrt 函数,您可以使用 Intel 内在语法来实现:

If you did want to implement a no-checks scalar sqrt function yourself, you'd do it with Intel intrinsics syntax:

// DON'T USE THIS, it defeats auto-vectorization
static inline
float sqrt_scalar(float x) {
    __m128 xvec = _mm_set_ss(x);
    xvec =  _mm_cvtss_f32(_mm_sqrt_ss(xvec));
}

使用 gcc 和 clang(没有 -ffast-math)编译成一个接近最优的标量循环.在 所述Godbolt编译器浏览器:

This compiles to a near-optimal scalar loop with gcc and clang (without -ffast-math). See it on the Godbolt compiler explorer:

# gcc6.2 -O3  for the sqrt_new loop using _mm_sqrt_ss.  good scalar code, but don't optimize further.
.L10:
    movss   xmm0, DWORD PTR [r12]
    add     r12, 4
    sqrtss  xmm0, xmm0
    addss   xmm1, xmm0
    cmp     r12, rbx
    jne     .L10

这个循环应该只在 SQRTSS 吞吐量上出现瓶颈(在 Haswell 上每 7 个时钟一个,明显比 SQRTSD 或 FSQRT 快),并且没有资源冲突.但是,与即使不重新排序 FP 添加的内容相比,它仍然是垃圾(因为 FP add/mul 不是真正的关联):智能编译器(或使用内在函数的程序员)将使用 SQRTPS 以相同的吞吐量获得 4 个结果1 个结果来自 SQRTSS.将 SQRT 结果的向量解包为 4 个标量,然后您可以保持完全相同的操作顺序和相同的中间结果舍入.我很失望 clang 和 gcc 没有这样做.

This loop should bottleneck only on SQRTSS throughput (one per 7 clocks on Haswell, notably faster than SQRTSD or FSQRT), and with no resource conflicts. However, it's still garbage compared to what you could do even without re-ordering the FP adds (since FP add/mul aren't truly associative): a smart compiler (or programmer using intrinsics) would use SQRTPS to get 4 results with the same throughput as 1 result from SQRTSS. Unpack the vector of SQRT results to 4 scalars, and then you can keep exactly the same order of operations with identical rounding of intermediate results. I'm disappointed that clang and gcc didn't do this.

然而,gcc 和 clang 确实设法避免调用库函数.clang3.9(只有 -O3)使用 SQRTSS 甚至不检查 NaN.我认为这是合法的,而不是编译器错误.也许它看到代码没有使用errno?

However, gcc and clang do manage to actually avoid calling a library function. clang3.9 (with just -O3) uses SQRTSS without even checking for NaN. I assume that's legal, and not a compiler bug. Maybe it sees that the code doesn't use errno?

另一方面,gcc6.2 推测性地内联了 sqrtf(),带有 SQRTSS 并检查输入以查看它是否需要调用库函数.

gcc6.2 on the other hand speculatively inlines sqrtf(), with a SQRTSS and a check on the input to see if it needs to call the library function.

# gcc6.2's sqrt() loop, without -ffast-math.
# speculative inlining of SQRTSS with a check + fallback
# spills/reloads a lot of stuff to memory even when it skips the call :(

# xmm1 = 0.0  (gcc -fverbose-asm says it's holding sqrt2, which is zero-initialized, so I guess gcc decides to reuse that zero)
.L9:
    movss   xmm0, DWORD PTR [rbx]
    sqrtss  xmm5, xmm0
    ucomiss xmm1, xmm0                  # compare input against 0.0
    movss   DWORD PTR [rsp+8], xmm5
    jbe     .L8                         # if(0.0 <= SQRTSS input || unordered(0.0, input)) { skip the function call; }
    movss   DWORD PTR [rsp+12], xmm1    # silly gcc, this store isn't needed.  ucomiss doesn't modify xmm1
    call    sqrtf                       # called for negative inputs, but not for NaN.
    movss   xmm1, DWORD PTR [rsp+12]
.L8:
    movss   xmm4, DWORD PTR [rsp+4]     # silly gcc always stores/reloads both, instead of putting the stores/reloads inside the block that the jbe skips
    addss   xmm4, DWORD PTR [rsp+8]
    add     rbx, 4
    movss   DWORD PTR [rsp+4], xmm4
    cmp     rbp, rbx
    jne     .L9

不幸的是,gcc 在这里自爆,就像 MSVC 对 inline-asm 所做的一样:有一个存储转发往返作为循环携带的依赖项.所有溢出/重新加载都可能在 JBE 跳过的块内.也许 gcc 的东西负面输入会很常见.

gcc unfortunately shoots itself in the foot here, the same way MSVC does with inline-asm: There's a store-forwarding round trip as a loop-carried dependency. All the spill/reloads could be inside the block skipped by the JBE. Maybe gcc things negative inputs will be common.

更糟糕的是,如果您确实使用了 /fp:fast-ffast-math,即使像 clang 这样聪明的编译器也无法重写您的 _mm_sqrt_ss 转换成 SQRTPS.Clang 通常不仅擅长将内在函数 1:1 映射到指令,而且如果您错过组合事物的机会,它还会提出更优化的洗牌和混合.

Even worse, if you do use /fp:fast or -ffast-math, even a clever compiler like clang doesn't manage to rewrite your _mm_sqrt_ss into a SQRTPS. Clang is normally pretty good at not just mapping intrinsics to instructions 1:1, and will come up with more optimal shuffles and blends if you miss an opportunity to combine things.

所以在启用快速 FP 数学的情况下,使用 _mm_sqrt_ss 是一个很大的损失.clang 将 sqrt() 库函数调用版本编译为 RSQRTPS + 牛顿-拉夫逊迭代.

So with fast FP math enabled, using _mm_sqrt_ss is a big loss. clang compiles the sqrt() library function call version into RSQRTPS + a newton-raphson iteration.

另请注意,您的微基准测试代码对 sqrt_new() 实现的延迟不敏感,仅对吞吐量敏感.延迟在实际 FP 代码中通常很重要,而不仅仅是吞吐量.但在其他情况下,比如对许多数组元素独立做同样的事情,延迟并不重要,因为乱序执行可以通过在许多循环迭代中运行指令来很好地隐藏它.

Also note that your microbenchmark code isn't sensitive to the latency of your sqrt_new() implementation, only the throughput. Latency often matters in real FP code, not just throughput. But in other cases, like doing the same thing independently to many array elements, latency doesn't matter, because out-of-order execution can hide it well enough by having instructions in flight from many loop iterations.

正如我之前提到的,延迟来自额外存储/重新加载数据在进出 MSVC 风格的内联汇编 过程中的往返行程是一个严重的问题.当 MSVC 内联函数时,fld n 不是直接来自数组.

As I mentioned earlier, latency from theextra store/reload round trip your data takes on its way in/out of MSVC-style inline-asm is a serious problem here. When MSVC inlines the function, the fld n doesn't come directly from the array.

顺便说一句,Skylake 的 SQRTPS/SS 吞吐量为每 3 个周期一个,但仍有 12 个周期的延迟.SQRTPD/SD 吞吐量 = 每 4-6 个周期一个,延迟 = 15-16 个周期.所以 FP 平方根在 Skylake 上比在 Haswell 上更加流水线化.这放大了基准 FP sqrt 延迟与吞吐量之间的差异.

BTW, Skylake has SQRTPS/SS throughput of one per 3 cycles, but still 12 cycle latency. SQRTPD/SD throughput = one per 4-6 cycles, latency = 15-16 cycles. So FP square root is more pipelined on Skylake than on Haswell. This magnifies the difference between benchmarking FP sqrt latency vs. throughput.

这篇关于理解为什么 ASM fsqrt 实现比标准 sqrt 函数快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆