英特尔内在函数中的延迟与吞吐量 [英] latency vs throughput in intel intrinsics

查看:22
本文介绍了英特尔内在函数中的延迟与吞吐量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

总的来说,我认为我对延迟和吞吐量之间的区别有很好的理解.但是,对于英特尔 Intrinsics,我不清楚延迟对指令吞吐量的影响,尤其是在按顺序(或几乎按顺序)使用多个内部调用时.

I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).

例如,让我们考虑:

_mm_cmpestrc

这在 Haswell 处理器上的延迟为 11,吞吐量为 7.如果我在循环中运行这条指令,我会在 11 个周期后获得一个连续的每个周期输出吗?由于这需要一次运行 11 条指令,而且我的吞吐量为 7,我是否会用完执行单元"?

This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?

除了了解一条指令相对于不同版本的代码需要多长时间之外,我不知道如何使用延迟和吞吐量.

I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.

推荐答案

有关 CPU 性能的更完整图片,请参阅 Agner Fog 的微架构指南和指令表.(他的 Optimizing C++ 和 Optimizing Assembly 指南也很棒).另请参阅 标签维基中的其他链接,尤其是Intel 的优化手册.

For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.

另见

预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我该如何手动计算它们?有关使用指令成本数字的更多详细信息.

and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.

计算某个位置或更低位置的设置位的有效方法是什么? 例如,在前端 uop、后端端口和延迟方面分析 asm 的短序列.

What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.

单条指令的延迟和吞吐量实际上不足以为使用混合向量指令的循环获得有用的图片.这些数字不会告诉您哪些内在函数(asm 指令)相互竞争吞吐量资源(即它们是否需要相同的执行端口).它们仅适用于超级简单的循环,例如加载/做一件事/存储,或例如使用 _mm_add_ps_mm_add_epi32 对数组求和.

Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.

您可以使用多个累加器来获得更多指令级并行性,但是您'仍然只使用一个内在的,所以你有足够的信息来看到,例如Skylake 之前的 CPU 每个时钟只能维持一个 _mm_add_ps 的吞吐量,而 SKL 每个时钟周期可以启动两个(每 0.5c 一个的倒数吞吐量).它可以在两个完全流水线的 FMA 执行单元上运行 ADDPS,而不是只有一个专用的 FP-add 单元,因此吞吐量比 Haswell(3c lat,每 1c tput 一个)更好,但延迟更差.

You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).

由于 _mm_add_ps 在 Skylake 上有 4 个周期的延迟,这意味着可以同时进行 8 个向量-FP 添加操作.所以你需要 8 个独立的向量累加器(你在最后相互添加)来暴露这么多的并行性.(例如,使用 8 个单独的 __m256 sum0, sum1, ... 变量手动展开循环.编译器驱动的展开(使用 -funroll-loops -ffast-math 编译)将经常使用相同的寄存器,但循环开销不是问题).

Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).

这些数字还忽略了英特尔 CPU 性能的第三个主要维度:融合域 uop 吞吐量.大多数指令解码为单个 uop,但有些指令解码为多个 uop.(特别是 SSE4.2 字符串指令,例如您提到的 _mm_cmpestrc:PCMPESTRI 在 Skylake 上是 8 uops).即使在任何特定的执行端口上都没有瓶颈,您仍然可以在前端保持乱序核心处理工作的能力上遇到瓶颈.英特尔 Sandybridge 系列 CPU 每个时钟最多可以发出 4 个融合域 uops,实际上,当其他瓶颈不发生时,通常可以接近该值.(参见 是性能在执行 uop 计数不是处理器宽度倍数的循环时减少? 对于不同循环大小的一些有趣的最佳情况前端吞吐量测试.)由于加载/存储指令使用与 ALU 指令不同的执行端口,这可以是L1缓存中数据热时的瓶颈.

Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.

除非您查看编译器生成的 asm,否则您将不知道编译器必须使用多少额外的 MOVDQA 指令来在寄存器之间复制数据,以解决以下事实:没有 AVX,大多数指令会替换它们的第一个源注册结果.(即破坏性目的地).您也不会知道循环中任何标量操作的循环开销.

And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.

我想我对延迟和吞吐量之间的区别有很好的理解

I think I have a decent understanding of the difference between latency and throughput

你的猜测似乎没有意义,所以你肯定错过了一些东西.

Your guesses don't seem to make sense, so you're definitely missing something.

CPU 是流水线化的,它们内部的执行单元也是如此.完全流水线化"的执行单元每个周期可以开始一个新的操作(吞吐量=每个时钟一个)

CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)

  • (倒数)吞吐量是当没有数据依赖强制它等待时操作可以开始的频率,例如此指令每 7 个周期一个.

  • (reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.

延迟是指一个操作的结果准备好所需的时间,通常只有当它是循环携带的依赖链的一部分时才重要.

Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.

如果循环的下一次迭代独立于前一次运行,则乱序执行可以看到"远远领先于在两次迭代之间找到指令级并行并保持自己忙碌,仅在吞吐量方面出现瓶颈.

If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.

这篇关于英特尔内在函数中的延迟与吞吐量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆