延迟与intel内在函数的吞吐量 [英] latency vs throughput in intel intrinsics

查看:188
本文介绍了延迟与intel内在函数的吞吐量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一般来说,我认为我对延迟和吞吐量之间的差异有很好的了解.但是,对于英特尔内部而言,延迟对指令吞吐量的影响尚不清楚,特别是当顺序(或几乎连续)使用多个内部调用时.

例如,让我们考虑:

_mm_cmpestrc

这在Haswell处理器上的延迟为11,吞吐量为7.如果我在一个循环中运行此指令,那么11个周期后,我是否可以获得连续的每个周期输出?因为这一次需要运行11条指令,并且由于我的吞吐量为7,所以我用完了执行单元"吗?

除了不确定一条指令相对于不同版本的代码要花费多长时间之外,我不确定如何使用延迟和吞吐量.

解决方案

有关CPU性能的更全面的信息,请参见 Agner Fog的微体系结构指南和说明表. (他的优化C ++"和优化汇编"指南也很出色).另请参见标签Wiki的其他链接.英特尔的优化手册.

有关分析短代码序列的示例,请参见

  • 指令级并行度,但是您仍然只使用一种内在函数,因此您确实有足够的信息可以看到例如Skylake之前的CPU只能维持每个时钟一个_mm_add_ps的吞吐量,而SKL可以每个时钟周期启动两个_mm_add_ps(相互的吞吐量为每0.5c一个).它可以在两个完全流水线化的FMA执行单元上运行ADDPS,而不是只有一个专用的FP-add单元,因此吞吐量比Haswell更好(但延迟时间为3c,每1c tput一个).

    由于_mm_add_ps在Skylake上的延迟为4个周期,因此这意味着可以一次执行8个矢量FP加法运算.因此,您需要8个独立的向量累加器(在末尾将它们加在一起)来展示这么多的并行性. (例如,使用8个单独的__m256 sum0, sum1, ...变量手动展开循环.由编译器驱动的展开(使用-funroll-loops -ffast-math编译)通常会使用相同的寄存器,但是循环开销不是问题).


    这些数字还忽略了英特尔CPU性能的第三个主要方面:融合域uop吞吐量.大多数指令都解码为单个uop,但有些则解码为多个uops. (特别是您提到的_mm_cmpestrc之类的SSE4.2字符串指令:PCMPESTRI在Skylake上为8 oups).即使在任何特定的执行端口上没有瓶颈,您仍然可以在前端使无序的内核承担工作量的能力上遇到瓶颈.英特尔Sandybridge系列CPU每个时钟最多可发出4个融合域uops,实际上,在没有其他瓶颈的情况下,通常可以接近该值. (请参见性能在执行uop计数不是处理器宽度的倍数的循环时会减少吗?用于针对不同循环大小的一些有趣的最佳情况的前端吞吐量测试.)由于加载/存储指令使用的执行端口与ALU指令不同,因此可以L1缓存中数据热时的瓶颈.

    除非您查看编译器生成的asm,否则您将不知道编译器必须使用多少额外的MOVDQA指令来在寄存器之间复制数据,以解决以下事实:如果没有AVX,大多数指令会替换其第一个源代码向结果注册. (即破坏性目的地).您也不会从循环中的任何标量运算中了解循环开销.


    我认为我对延迟和吞吐量之间的差异有很好的了解

    您的猜测似乎没有道理,因此您肯定缺少某些东西.

    CPU已流水,其中的执行单元也已流水.一个完全流水线"的服务器.执行单元可以在每个周期开始一个新操作(吞吐量=每个时钟一个)

    • (互惠)吞吐量是在没有数据依赖项迫使其等待时(例如,此指令每7个周期1个.

    • 延迟是准备好一个操作的结果需要多长时间,通常仅当它是循环承载的依赖链的一部分时才重要.

      如果循环的下一个迭代与前一个迭代独立地操作,则乱序执行可以看到"或关闭".足够远的距离可以在两次迭代之间找到指令级并行化,并保持繁忙,瓶颈仅在吞吐量上.


    (尚未完全完成编辑,请稍后整理.)

    I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).

    For example, let's consider:

    _mm_cmpestrc
    

    This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?

    I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.

    解决方案

    For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the tag wiki, especially Intel's optimization manual.

    For examples of analyzing short sequences of code, see


    Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.

    You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).

    Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).


    Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.

    And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.


    I think I have a decent understanding of the difference between latency and throughput

    Your guesses don't seem to make sense, so you're definitely missing something.

    CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)

    • (reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.

    • Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.

      If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.


    (Not totally done editing, will tidy this up later.)

    这篇关于延迟与intel内在函数的吞吐量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆