为什么循环指令慢?无法英特尔已经有效地实现了吗? [英] Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

查看:131
本文介绍了为什么循环指令慢?无法英特尔已经有效地实现了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

循环递减ECX,如果非零跳转。 它很慢,但不能有英特尔廉价使它快?单癸和分支UOP已经可能的(唯一的区别是,设置标志)。

loop decrements ecx, and jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? A single dec-and-branch uop is already possible (the only difference being that that sets flags).

循环在不同的微架构,从瓦格纳雾的表:

loop on various microarchitectures, from Agner Fog's tables:


  • K8 / K10:7 M-OPS

  • 推土机家族:1米-OP(同样的成本作为宏融合的比较和分支,或 jecxz

P4:4微指令(同 jecxz

P4: 4 uops (same as jecxz)

Silvermont:7微指令

Silvermont: 7 uops.

威盛Nano3000:2微指令

Via Nano3000: 2 uops

无法去codeRS只是去code它在最坏的情况一样 LEA ECX,[RCX-1] / jecxz ?这将是3微指令。

Couldn't the decoders just decode it to at worst the same as lea ecx, [rcx-1] / jecxz? That would be 3 uops.

或者更好的,只是去code作为一个融合的,不设置标志减速和转移? 十二月ECX / JNZ 上SNB德codeS到一个单一的UOP(它设置的标志)。

Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).

我知道真正的code不使用它(因为它,因为至少P5或某事进展缓慢),但AMD决定是值得的,使之快推土机。大概是因为它很容易。

I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.


  • 难道是容易的SNB家族uarch有快循环如果是这样,为什么不?如果不是,为什么难?很多去codeR晶体管?或在熔融癸&放大器额外比特;分支UOP来记录,它并没有设置标志的?还有什么比这7微指令在做什么?这是一个非常简单的指令。

  • Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.

什么特别的推土机,使得快速循环易/值得吗?或没有AMD浪费做了一堆晶体管循环快?如果是这样,presumably有人认为这是一个好主意。

What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.

如果循环非常快后,这将是完美的BigInteger arbitrary- precision ADC 循环,避免局部标志档位/减速(看到我对我的回答评论),或者你要循环而不触及标记任何其他情况。它也有一个轻微的code尺寸优势 DEC / JNZ 。 (和 DEC / JNZ 的SNB-家里只有宏观保险丝)。

If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).

它不会从惹恼所有与使用循环不好的16位code中的问题阻止我每循环,甚至当他们还需要另一个柜台内循环。但至少它不会是的的不好。

It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

推荐答案

现在,我用Google搜索的之后的写我的问题,原来是一对的 comp.arch ,该上来的时候了。我希望它是很难谷歌(很多的为什么我的循环慢命中),但我第一次尝试(为什么是x86循环指令慢)得到的结果。

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.

这可能是我们会得到最好的,就足够了,除非有人能在上面洒了一些更多的光线。我并没有特意写这作为一个答案 - 我 - 拥有 - 问题后。

It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.

好职位,在该线程不同的理论:

Good posts with different theories in that thread:

罗伯特

LOOP变得慢了一些最早的机器(约486),当
  显著流水线开始发生,并运行任何但
  顺着管道简单的指令有效地在技术上
  不切实际的。所以环是一个世代数缓慢。所以没有人
  用它。因此,当它成为可能加速这一过程,没有真正
  动机这样做,因为没有人真正使用它。

LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it.

安东·埃特尔

IIRC LOOP在一些软件定时循环使用;有
  (重要)的软件,没有对CPU的工作,其中,loop太快
  (这是在90年代初左右)。所以CPU厂商学会了做LOOP
  慢。

IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow.

(保罗和其他人:不客气重新发布自己的写作作为自己的答案,我会从我的答案和上投你删除它)


(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)

@保罗A.克莱顿(偶尔的 SO海报和CPU架构的家伙):

@Paul A. Clayton (occasional SO poster and CPU architecture guy):

我同意较短的顺序应该是可能的,但我试图
  想臃肿序列的可能意义,如果的最小
  微架构的调整,允许的。

I agree that a shorter sequence should be possible, but I was trying to think of a bloated sequence that might make sense if minimal microarchitectural adjustments were permitted.

摘要:设计师希望循环来支持的只有的通过微code,没有调整任何硬件正确。

summary: The designers wanted to loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.

如果一个无用的,只有兼容性指令交给
  微code语言开发人员,他们可能有理由不能够或愿意
  建议内部微小的变化,以提高
  这样的指令。不仅将他们宁愿用自己的变革
  建议之都更高效,但改变的建议
  一个无用的情况下,将减少其他建议的可信度。

If a useless, compatibility-only instruction is handed to the microcode developers, they might reasonably not be able or willing to suggest minor changes to the internal microarchitecture to improve such an instruction. Not only would they rather use their "change suggestion capital" more productively but the suggestion of a change for a useless case would reduce the credibility of other suggestions.

...

背后纳米建筑师可能已经找到避免特殊外壳
  LOOP简化他们的设计在区域或权力的条款。或者,他们
  可能有从嵌入式激励用户提供快速
  实现(code密度优势)。这些只是的 WILD
  猜测。

The architects behind Nano may have found avoiding the special casing of LOOP simplified their design in terms of area or power. Or they may have had incentives from embedded users to provide a fast implementation (for code density benefits). Those are just WILD guesses.

如果循环的优化掉出其他优化(比如融合
  比较和分支),它可能更容易调整LOOP成快
  不是路径的指令来处理它在微code,即使
  LOOP的表现不重要。

If optimization of LOOP fell out of other optimizations (like fusion of compare and branch), it might be easier to tweak LOOP into a fast path instruction than to handle it in microcode even if the performance of LOOP was unimportant.

我怀疑该决定是基于的具体细节
  实现。对这些信息的信息似乎并没有被
  一般可用,跨preting这些信息会
  出乎大多数人的技术水平。 (我不是一个硬件
  设计师 - 还从来没打过人在电视上或停留在一个
  假日前preSS。 : - )

I suspect that such decisions are based on specific details of the implementation. Information about such details does not seem to be generally available and interpreting such information would be beyond the skill level of most people. (I am not a hardware designer--and have never played one on television or stayed at a Holiday Inn Express. :-)

该线程然后去题外话成AMD吹我们一个机会来清理x86指令编码的克鲁夫特的境界。这很难责怪他们,因为每一个变化就是去codeRS不能共享晶体管的情况。而在此之前英特尔采用X86-64,它甚至不清楚它会流行起来。 AMD不想负担他们与所使用的硬件CPU的人如果AMD64没有流行起来。


The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.

但尽管如此,有这么多的小东西: setcc 可能已更改为32位。 (通常你必须使用XOR零/测试/ setcc,以避免错误的依赖关系,或者是因为你需要一个零扩展REG)。转变可能无条件地写标志,甚至零移位计数(解除对EFLAGS变量数移OOO执行输入数据的依赖)。我最后一次输入的眼中钉这份名单中,我认为有第三个......哦,是的, BT / BTS 等与存储器操作数具有依赖于指数的高位的地址(比特串,而不仅仅是一个机器字内的位)。这些说明适用于原子位字段的东西非常有用,比他们要慢一些。

But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word). Those instructions are extremely useful for atomic bit-field stuff, and are slower than they need to be.

这篇关于为什么循环指令慢?无法英特尔已经有效地实现了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆