下循环索引:在新的CPU正向索引快? [英] C for loop indexing: is forward-indexing faster in new CPUs?

查看:97
本文介绍了下循环索引:在新的CPU正向索引快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我订阅了邮件列表,两个相当知识渊博(IMO)程序员们讨论一些优化code,说线沿线的东西:

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

在公布5-8年前的CPU,这是稍快向后迭代的for循环(例如的for(int i = X-1; I> = 0 ;我 - ){...} ),因为比较 I 零比比较它与其他一些更有效。但随着最近才的CPU(2008 - 2009年从的例如的)投机装载机的逻辑是这样的,它的作品更好,如果for循环迭代向前(例如的for(int i = 0; I< X,我++){...}

On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).

我的问题是,是真的吗?有CPU实现改变最近,这样向前循环迭代现在有超过向后迭代的优势呢?如果是这样,什么是该如何解释呢?的的有什么改变?

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(是的,我知道,premature优化是一切罪恶的根源,担心微优化,等等等等之前查看我的算法......主要是我只是好奇)

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)

推荐答案

你真的问prefetching,不是环路控制逻辑。

You're really asking about prefetching, not about loop control logic.

在一般情况下,循环性能是不会由控制逻辑来规定(即,递增/递减和得到通过检查每次的状态)。它需要做这些事情的时间是除的非常的紧密循环无关紧要。如果你感兴趣的是,看看<一个href=\"http://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus/1950995#1950995\">John Knoeller的回答关于8086的计数寄存器以及为什么会在旧天倒计时是更有效的已经真实细节。正如约翰说,分支prediction (也炒)可以发挥性能的作用在这里,可以指令prefetching

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

迭代顺序的可以影响性能显著时,它会更改你的循环触动记忆的顺序。在您请求的内存地址可能会影响到被抽入您的缓存还什么从缓存中逐出的顺序时,有不再室抓取新的高速缓存线。说完就往往需要去记忆是昂贵得多比相比,增量或递减。关于现代CPU可能需要数千周期从处理器到存储器获得,和处理器可以具有闲置的部分或全部的该时间

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

您可能熟悉缓存的,所以我不会进入所有的这些细节在这里。你可能不知道的是,现代的处理器采用的 prefetchers 整体转换,试图predict你要什么样的数据在不同级别的存储层次结构的下一个需要。一旦predict,他们试图从内存或更低级别的缓存拉的数据,让你有当你避开处理它你需要什么。这取决于他们如何抓住你需要什么其次,你的表现可能会或可能不会使用它们时提高。

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

看看<一个href=\"http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-$p$pfetchers/\">Intel's引导硬件prefetchers 优化。有列出了四个prefetchers;二为的NetBurst芯片

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:


  1. NetBurst架构的硬件prefetcher 可以检测内存存取数据流中向前或向后的方向,它会尝试从这些地方进入二级缓存加载数据。

  2. NetBurst架构的的有相邻的高速缓存行(ACL)prefetcher ,当你取第一个,它会自动加载两个相邻的高速缓存行。

  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

和两个用于核心


  1. 核心有一个稍微复杂的硬件prefetcher;它可以检测的跨入的除了连续引用流接入,所以它会通过一个数组做,如果你踩好所有其他元素,每4个等。

  2. 核心还具有ACL prefetcher像NetBurst架构。

  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

如果您通过数组迭代前行,你会产生一串连续的,通常连续内存的引用。该ACL prefetchers要做多少正向循环更好(因为你最终会使用这些后续的高速缓存行)比落后循环,但你可能会做确定决策内存引用落后,如果prefetchers可以检测这(与硬件prefetchers)。在核心硬件prefetchers可检测的进步,这是用于更复杂阵列遍历有帮助的。

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

这些简单的启发式的可以让你陷入在某些情况下麻烦。例如,英特尔实际上是建议您关闭相邻的高速缓存行prefetching服务器,因为它们往往使比台式机用户更多的随机内存引用。概率的的使用相邻的高速缓存行是在服务器上更高,你不会真的要使用这样取数据结束了污染缓存(与不需要的数据填充它),性能会受到影响。欲了解更多有关解决这类问题,看看从本文上的 =HTTP:// WWW .cs.princeton.edu /〜thhung /酒吧/ sc09.pdf>在大型数据中心使用机器学习来调整prefetchers 的。有些人在谷歌上的文件;表现的东西是非常关注他们。

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

简单启发式不会帮助你更复杂的算法,你可能不得不开始考虑你的L1,L2等缓存的大小。图像处理,例如,通常需要您在2D图像小节执行一些操作,但是你遍历图像顺序可能会影响它的有用的部分是如何以及留在你的缓存,而不被驱逐。在 Z顺序遍历看看和的循环平铺如果你有兴趣在这样的事情。它的图像数据的2D局部性映射到的存储器中的一维局部性来提高性能的一个pretty基本示例。这也是在那里编译器并不总是能够调整你的code的最佳途径的领域,但手动调整你的C code能显着提高缓存性能。

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

我希望这给你的迭代顺序如何影响内存性能的想法。它不依赖于特定的体系结构,但思路是通用的。你应该能够理解在AMD和Power prefetching如果你能理解它在Intel,你真的没有懂得汇编组织你的code采取的内存优势。你只需要知道一点计算机体系结构。

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

这篇关于下循环索引:在新的CPU正向索引快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆