REP 做什么设置? [英] What setup does REP do?

查看:41
本文介绍了REP 做什么设置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用英特尔® 64 和 IA-32 架构优化参考手册,第 2.4.6 节REP 字符串增强":

Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":

使用 REP 字符串的性能特征可归因于两个组成部分:启动开销和数据传输吞吐量.

The performance characteristics of using REP string can be attributed to two components: startup overhead and data transfer throughput.

[...]

对于较大粒度数据传输的REP字符串,作为ECX值增加,REP String 的启动开销呈阶梯式增加:

For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP String exhibit step-wise increase:

  • 短字符串(ECX <= 12):REP MOVSW/MOVSD/MOVSQ 的延迟约为 20 个周期
  • 快速字符串(ECX >= 76:不包括 REP MOVSB):处理器实现提供硬件通过以 16 字节移动尽可能多的数据来优化.如果 16 字节数据之一跨越缓存线边界的传输:

  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles,
  • Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:

  • 无分裂:延迟包括大约 40 个周期的启动成本,每 64 字节的数据增加 4 个周期,
  • 缓存分裂:延迟包括一个启动成本约 35 个周期,每 64 字节数据增加 6 个周期.
  • Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles,
  • Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6 cycles.

中间字符串长度:REP MOVSW/MOVSD/MOVSQ 的延迟有大约 15 个周期的启动成本加上每次迭代的一个周期word/dword/qword 中的数据移动.

Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

(强调我的)

没有进一步提及此类启动成本.它是什么?它有什么作用,为什么总是需要更多时间?

There is no further mention of such startup cost. What is it? What does it do and why does it take always more time?

推荐答案

请注意,只有 rep movsrep stos 是快速的.repe/ne cmpsscas 在当前 CPU 上一次只循环 1 个元素.(https://agner.org/optimize/ 有一些性能数字,例如每个 RCX 计数 2 个周期repe cmpsb).不过,它们仍然有一些微代码启动开销.

Note that only rep movs and rep stos are fast. repe/ne cmps and scas on current CPUs only loop 1 element at a time. (https://agner.org/optimize/ has some perf numbers, like 2 cycles per RCX count for repe cmpsb). They still have some microcode startup overhead, though.

rep movs 微码有多种策略可供选择.如果 src 和 dest 不紧密重叠,微编码循环可以以更大的 64b 块传输.(这是 P6 引入的所谓快速字符串"功能,偶尔会为支持更广泛加载/存储的 CPU 重新调整).但是,如果 dest 距离 src 只有一个字节,rep movs 必须产生与您从许多单独的 movs 指令中得到的完全相同的结果.

The rep movs microcode has several strategies to choose from. If the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks larger. (This is the so-called "fast strings" feature introduced with P6 and occasionally re-tuned for later CPUs that support wider loads/stores). But if dest is only one byte from src, rep movs has to produce the exact same result you'd get from that many separate movs instructions.

所以微码必须检查重叠,可能还有对齐(src 和 dest 分开,或相对对齐).它可能还会根据小/中/大计数器值进行选择.

So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.

根据 Andy Glew 对 href="https://stackoverflow.com/questions/8858778/why-are-complicated-memcpy-memset-superior">为什么复杂的 memcpy/memset 更高级?微码中的条件分支不是't 受制于分支预测.因此,如果默认的未采用路径不是实际采用的路径,则启动周期会受到重大损失,即使对于使用具有相同对齐方式和大小的相同 rep movs 的循环也是如此.

According to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior?, conditional branches in microcode aren't subject to branch-prediction. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same rep movs with the same alignment and size.

他在 P6 中监督了最初的 rep 字符串实现,所以他应该知道.:)

He supervised the initial rep string implementation in P6, so he should know. :)

REP MOVS 使用缓存协议功能,该功能不可用常规代码.基本上像 SSE 流媒体商店,但在某种程度上与正常的内存排序规则等兼容.//选择和设置正确方法的大量开销"是主要是由于缺乏微码分支预测.我有很长希望我已经使用硬件状态机实现了 REP MOVS而不是微码,这本可以完全消除开销.

REP MOVS uses a cache protocol feature that is not available to regular code. Basically like SSE streaming stores, but in a manner that is compatible with normal memory ordering rules, etc. // The "large overhead for choosing and setting up the right method" is mainly due to the lack of microcode branch prediction. I have long wished that I had implemented REP MOVS using a hardware state machine rather than microcode, which could have completely eliminated the overhead.

顺便说一句,我早就说过硬件可以做的事情之一比软件更好/更快的是复杂的多路分支.

By the way, I have long said that one of the things that hardware can do better/faster than software is complex multiway branches.

英特尔 x86 拥有快速字符串"从 1996 年的 Pentium Pro (P6) 开始,我监督的.P6 快弦采用了 REP MOVSB 和更大的弦,并且使用 64 位微码加载和存储以及无 RFO 实现它们缓存协议.它们没有违反内存排序,不像 ERMSBiVB.

Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996, which I supervised. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a no-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in iVB.

在微码中处理快速字符串的一大弱点是 (a) 微码分支预测错误,以及 (b) 微代码与每一代人,越来越慢,直到有人绕过修复它.就像图书馆的人抄本走调一样.一世假设错失的机会之一可能是在可用时使用 128 位加载和存储,依此类推

The big weakness of doing fast strings in microcode was (a) microcode branch mispredictions, and (b) the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on

回想起来,我应该编写一个自调整基础设施,以在每一代都获得相当好的微码.但这不会帮助使用新的、更广泛的负载和存储,当它们成为可用的.//Linux 内核似乎有这样的自动调优基础设施,即在启动时运行.//不过总的来说,我主张可以在模式之间平滑转换的硬件状态机,不会导致分支预测错误.//是否有争议良好的微代码分支预测可以避免这种情况.

In retrospect, I should have written a self-tuning infrastructure, to get reasonably good microcode on every generation. But that would not have helped use new, wider, loads and stores, when they became available. // The Linux kernel seems to have such an autotuning infrastructure, that is run on boot. // Overall, however, I advocate hardware state machines that can smoothly transition between modes, without incurring branch mispredictions. // It is debatable whether good microcode branch prediction would obviate this.

基于此,我对特定答案的最佳猜测是:通过微代码的快速路径(尽可能多的分支实际上采用默认的未采用路径)是 15 周期启动情况,对于中间长度.

Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.

由于英特尔没有公布完整的细节,所以我们能做的最好的是对各种尺寸和对齐方式的循环计数进行黑盒测量.幸运的是,这就是我们做出正确选择所需的全部内容.英特尔的手册和http://agner.org/optimize/,有关于如何使用 rep movs 的好信息.

Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. Fortunately, that's all we need to make good choices. Intel's manual, and http://agner.org/optimize/, have good info on how to use rep movs.

有趣的事实:没有 ERMSB(IvB 中的新功能):rep movsb 针对小型副本进行了优化.对于大型(我认为超过几百个字节)副本,启动时间比 rep movsdrep movsq 长,即使在此之后也可能达不到相同的效果吞吐量.

Fun fact: without ERMSB (new in IvB): rep movsb is optimized for small-ish copies. It takes longer to start up than rep movsd or rep movsq for large (more than a couple hundred bytes, I think) copies, and even after that may not achieve the same throughput.

对于没有 ERMSB 和 SSE/AVX(例如在内核代码中)的大型对齐副本的最佳序列可能是 rep movsq 然后用类似未对齐的 mov 进行清理code> 复制缓冲区的最后 8 个字节,可能与 rep movsq 所做的最后一个对齐的块重叠.(基本上使用 glibc 的小拷贝 memcpy 策略).但是,如果大小可能小于 8 个字节,则需要进行分支,除非复制比需要更多的字节是安全的.如果小代码大小比性能更重要,则 rep movsb 是一个清理选项.(如果 RCX = 0,rep 将复制 0 个字节).

The optimal sequence for large aligned copies without ERMSB and without SSE/AVX (e.g. in kernel code) may be rep movsq and then clean-up with something like an unaligned mov that copies the last 8 bytes of the buffer, possibly overlapping with the last aligned chunk of what rep movsq did. (basically use glibc's small-copy memcpy strategy). But if the size might be smaller than 8 bytes, you need to branch unless it's safe to copy more bytes than needed. Or rep movsb is an option for cleanup if small code-size matters more than performance. (rep will copy 0 bytes if RCX = 0).

即使在具有增强型 Rep Move/Stos B 的 CPU 上,SIMD 矢量循环通常至少比 rep movsb 快一点.尤其是在不能保证对齐的情况下.(memcpy 的增强型 REP MOVSB,另请参阅英特尔的优化手册.链接 在 x86 标签维基中)

A SIMD vector loop is often at least slightly faster than rep movsb even on CPUs with Enhanced Rep Move/Stos B. Especially if alignment isn't guaranteed. (Enhanced REP MOVSB for memcpy, and see also Intel's optimization manual. Links in the x86 tag wiki)

更多细节:我认为在 SO 某处有一些关于测试 rep movsb 如何影响周围指令的乱序执行的讨论,从后面的指令多快 uops可以进入管道.我想我们在英特尔的专利中找到了一些信息,可以阐明该机制.

Further details: I think there's some discussion somewhere on SO about testing how rep movsb affects out-of-order exec of surrounding instructions, how soon uops from later instructions can get into the pipeline. I think we found some info in an Intel patent that shed some light on the mechanism.

微码可以使用一种谓词加载和存储 uop,让它在最初不知道 RCX 值的情况下发出一堆 uop.如果结果证明 RCX 是一个很小的值,那么其中一些 uops 会选择不做任何事情.

Microcode can use a kind of predicated load and store uop that lets it issue a bunch of uops without initially knowing the value of RCX. If it turns out RCX was a small value, some of those uops choose not to do anything.

我已经在 Skylake 上对 rep movsb 进行了一些测试.这似乎与初始突发机制一致:低于某个大小的阈值(例如 96 字节或其他大小),IIRC 性能对于任何大小几乎都是恒定的.(在 L1d 缓存中使用小的对齐缓冲区).我在具有独立 imul 依赖链的循环中使用 rep movs,测试它可以重叠执行.

I've done some testing of rep movsb on Skylake. It seems consistent with that initial-burst mechanism: below a certain threshold of size like 96 bytes or something, IIRC performance was nearly constant for any size. (With small aligned buffers hot in L1d cache). I had rep movs in a loop with an independent imul dependency chain, testing that it can overlap execution.

但是随后出现了超出该大小的显着下降,大概是当微码排序器发现它需要发出更多复制 uops 时.所以我认为当 rep movsb microcoded-uop 到达 IDQ 的前端时,它会让微码排序器发出足够的加载 + 存储 uop 以达到某个固定大小,并检查这是否足够或者如果需要更多.

But then there was a significant dropoff beyond that size, presumably when the microcode sequencer finds out that it needs to emit more copy uops. So I think when the rep movsb microcoded-uop reaches the front of the IDQ, it gets the microcode sequencer to emit enough load + store uops for some fixed size, and a check to see if that was sufficient or if more are needed.

这都是凭记忆,我在更新这个答案时没有重新测试.如果这与其他人的实际情况不符,请告诉我,我会再次检查.

This is all from memory, I didn't re-test while updating this answer. If this doesn't match reality for anyone else, let me know and I'll check again.

这篇关于REP 做什么设置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆