rep stosb指令如何比等效循环执行得更快? [英] How can the rep stosb instruction execute faster than the equivalent loop?

查看:310
本文介绍了rep stosb指令如何比等效循环执行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

指令rep stosb如何比该代码执行得更快?

How can the instruction rep stosb execute faster than this code?

    Clear: mov byte [edi],AL       ; Write the value in AL to memory
           inc edi                 ; Bump EDI to next byte in the buffer
           dec ecx                 ; Decrement ECX by one position
           jnz Clear               ; And loop again until ECX is 0

是否可以保证在所有现代CPU上都是如此?我是否应该始终喜欢使用rep stosb而不是手动编写循环?

Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?

推荐答案

在现代CPU中,rep stosbrep movsb的微编码实现实际上使用的存储区大于1B,因此可以使用比每个时钟一个字节快得多.

In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.

(请注意,此 only 仅适用于斯托斯和mov,不适用于repe cmpsbrepne scasb .不幸的是,它们仍然很慢,例如每个周期最多2个周期字节在Skylake上进行了比较,与可实现memcmp或memchr的AVX2 vpcmpeqb相比可悲.请参阅 https://agner.org /optimize/用于说明表,以及 x86标签Wiki 中的其他性能链接.

(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.

请参见为什么使用此代码启用优化功能的速度慢了6.5倍?,例如gcc不明智地内嵌repnz scasb或不太巧的strlen标量bithack,而后者恰巧变大了,这是一个简单的SIMD替代方法.)

See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)

rep stos/movs具有很大的启动开销,但对于大型memset/memcpy 来说,它的启动效果很好. (有关何时使用rep stos和用于小型缓冲区的向量化循环的讨论,请参阅Intel/AMD的优化手册.)但是,由于没有ERMSB功能,rep stosb已针对中小型内存集进行了调整,最佳使用rep stosdrep stosq(如果您不打算使用SIMD循环).

rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).

使用调试器单步执行时,rep stos仅执行一次迭代(ecx/rcx递减一),因此微代码实现永远不会实现.不要让这个愚弄你以为可以做的所有事情.

When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.

请参阅 REP是做什么设置的?有关英特尔如何进行操作的一些详细信息P6/SnB系列微体系结构实现rep movs.

See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.

有关内存带宽方面的注意事项,请参见>针对mcmcpy的增强型REP MOVSB 在具有ERMSB功能的Intel CPU上,rep movsb而不是SSE或AVX循环. (尤其要注意,由于对一次运行的高速缓存未命中次数以及RFO和非RFO存储协议的限制,许多核心Xeon CPU不能仅使用一个线程就饱和DRAM带宽.)

See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)

现代的Intel CPU应该在问题中以每个时钟一次迭代运行问题asm循环,但是AMD推土机系列内核甚至可能每个时钟都无法管理一个商店. (在处理inc/dec/branch指令的两个整数执行端口上的瓶颈.如果循环条件是edi上的cmp/jcc,则AMD内核可以对融合并分支进行宏融合.)

A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)

所谓的快速字符串操作(Intel P6和SnB系列CPU上的rep movsrep stos)的一个主要功能是,在存储到以前未缓存的内容时,它们避免了所有权拥有读取的缓存一致性流量内存.所以这就像使用NT存储来写入整个缓存行,但仍然是有序的(ERMSB功能的确使用弱排序的存储).

One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).

IDK AMD的实施效果如何.

IDK how good AMD's implementation is.

(还有一个更正:我之前说过,英特尔SnB只能处理每2个时钟1个分支的吞吐量,但实际上它可以每1个时钟1个迭代运行微小的循环.)

(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)

请参阅与标签Wiki.

See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.

Intel IvyBridge和后来的ERMSB,允许rep stos[b/w/d/q]rep movs[b/w/d/q]使用弱排序的存储(例如movnt),从而允许存储提交无序缓存.如果并非所有目的地在L1缓存中都已经很热,这将是一个优势.我相信,从文档的措辞来看,

Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.

因此,对于Intel IvB及更高版本上的大型对齐缓冲区,memsetrep stos实现可以击败任何其他实现.使用movnt存储(不会将数据保留在高速缓存中)的存储也应该接近饱和主存储器的写带宽,但实际上可能无法保持足够的速度.请参阅评论以进行讨论,但是我找不到任何数字.

So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.

对于小型缓冲区,不同的方法具有非常不同的开销量.微基准可以使SSE/AVX复制循环看起来比原来更好,因为每次进行相同大小和对齐的复制可以避免启动/清除代码中的分支错误预测. IIRC,建议对Intel CPU(不是rep movs)上128B以下的副本使用向量化循环.该阈值可能高于该阈值,具体取决于CPU和周围的代码.

For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.

英特尔的优化手册还讨论了不同memcpy实现的开销,并且rep movsb对错位的惩罚要大于movdqu.

Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.

有关优化的memset/memcpy实现的代码,请参见有关实际操作的更多信息. (例如,Agner Fog的图书馆).

See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).

这篇关于rep stosb指令如何比等效循环执行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆