什么设置不REP吗? [英] What setup does REP do?

查看:261
本文介绍了什么设置不REP吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用的英特尔®64和IA-32架构优化参考手册,§2.4.6REP字符串增强:

Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":

使用REP串的性能特征可以归因于两个组件:
   启动开销和数据传输的吞吐量。

The performance characteristics of using REP string can be attributed to two components: startup overhead and data transfer throughput.

[...]

对于较大粒度的数据传输REP串,作为ECX值
   增加, 的REP字符串展品逐步增加启动开销

For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP String exhibit step-wise increase:


      
  • 短字符串(ECX< = 12):REP MOVSW / MOVSD / MOVSQ的潜伏期大约为 20次

  •   
  • 快速字符串(ECX> = 76:
      不包括REP MOVSB​​):处理器实现提供了硬件
      由16个字节地移动数据作为多件优化。
      REP串延迟的延迟将发生变化,如果16字节数据中的一个
      在整个传输高速缓存行边界跨越:

  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles,
  • Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:


      
  • 免拆:潜伏期由大约40个周期并每个64字节数据的启动成本的增加了4个周期,

  •   
  • 缓存分割:潜伏期由启动的
      约35个循环
    并每个64字节的数据成本增加了6个周期。

  •   
  • Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles,
  • Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6 cycles.

中级字符串长度:REP MOVSW / MOVSD / MOVSQ的潜伏期有
  一个启动约15个周期加一周期的每个迭代成本
  在字/ DWORD /四字数据移动。

Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

(重点煤矿)

有这样启动成本没有进一步提及。它是什么?这是什么做的,为什么它总是需要更多的时间?

There is no further mention of such startup cost. What is it? What does it do and why does it take always more time?

推荐答案

代表MOVS 微code有多种策略可供选择。的如果的src和DEST不紧密重合,微codeD环可以在64B块转让(或甚至IVB大,后来与ERMSB)。但是,如果dest为从SRC只有一个字节,代表MOVS 有可能产生你会从许多不同的 MOVS 的说明。

The rep movs microcode has several strategies to choose from. If the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks (or even larger on IvB and later with ERMSB). But if dest is only one byte from src, rep movs has to produce the exact same result you'd get from that many separate movs instructions.

所以微code的检查重叠,而且很可能对齐(src和dest的单独,或相对对齐)。它很可能是基于小型/中型/大型计数器的值也选择了一些东西。

So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.

根据安迪GLEW的评论对回答的 http://stackoverflow.com/questions/8858778/why-are-complicated-memcpy-memset-superior\">Why是复杂的memcpy / memset的优越?,在微code 的条件分支不受科顺prediction 。因此,有在启动周期一个显著罚款,如果默认不采取的路径是不实际拍摄,即使使用相同的代表MOVS 的相同路线一环一和大小。

According to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior?, conditional branches in microcode aren't subject to branch-prediction. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same rep movs with the same alignment and size.

他监督了最初的代表在P6字符串实现,所以他应该知道。 :)

He supervised the initial rep string implementation in P6, so he should know. :)

REP MOVS使用高速缓存协议功能不可用
  定期code。基本上像SSE流存储,但在一个方式
  这是与普通内存排序规则等兼容的//
  对于选择和设置正确的方法大的开销是
  主要是由于缺乏微观code支线prediction的。我有很长
  希望我一直在使用硬件状态机实现的REP MOVS
  而不是微code,这可能完全消除
  开销。

REP MOVS uses a cache protocol feature that is not available to regular code. Basically like SSE streaming stores, but in a manner that is compatible with normal memory ordering rules, etc. // The "large overhead for choosing and setting up the right method" is mainly due to the lack of microcode branch prediction. I have long wished that I had implemented REP MOVS using a hardware state machine rather than microcode, which could have completely eliminated the overhead.

顺便说一句,我早就说过,事情之一硬件可以做
  更好/比软件快是复杂的多路分支。

By the way, I have long said that one of the things that hardware can do better/faster than software is complex multiway branches.

英特尔x86的有快字符串自Pentium Pro的(P6)在1996年,
  我监督。在P6快速串了REP MOVSB​​和大,
  64位微code加载和存储和无RFO实现他们
  缓存协议。他们并没有违反内存排序,不像ERMSB在
  IVB。

Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996, which I supervised. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a no-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in iVB.

在微code做快速串的一大弱点是(一)微code
  分支误predictions,和(b)微code下跌格格不入
  每一代,越来越慢,直到有人得到周围
  要修复它。就像一个图书馆男人复制瀑布格格不入。一世
  假设有可能的错失机会之一是
  使用128位载荷,当它们变得可存储,等等

The big weakness of doing fast strings in microcode was (a) microcode branch mispredictions, and (b) the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on

在回想起来,我应该写一个自我调整基础设施,
  得到每一代合理良好的微code。但不会
  已经帮助使用新的,更广泛的,加载和存储,当他们变得
  可用。 // Linux内核似乎有这样一种自动调谐
  基础设施,即在启动时运行。 //但总体而言,我主张
  硬件状态机可模式之间平滑地过渡,
  而不会产生分支误predictions。 //这是值得商榷是否
  良好的微code支线prediction将避免这一点。

In retrospect, I should have written a self-tuning infrastructure, to get reasonably good microcode on every generation. But that would not have helped use new, wider, loads and stores, when they became available. // The Linux kernel seems to have such an autotuning infrastructure, that is run on boot. // Overall, however, I advocate hardware state machines that can smoothly transition between modes, without incurring branch mispredictions. // It is debatable whether good microcode branch prediction would obviate this.

在此基础上,在一个特定的答案我最好的猜测是:通过微code中的快速路径(如许多分支尽可能采取实际的默认不采取路径)是15周期启动的情况下,中间的长度。

Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.

由于英特尔并没有公布全部细节,各种尺寸和路线循环计数的黑盒测量是我们可以做的最好的。 幸运的是,这一切都需要我们做出正确的选择英特尔的手册,并的http://瓦格纳。组织/优化/ ,对如何使用代表MOVS良好的信息

Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. Fortunately, that's all we need to make good choices. Intel's manual, and http://agner.org/optimize/, have good info on how to use rep movs.

这篇关于什么设置不REP吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆