C for 循环索引:在新 CPU 中前向索引是否更快? [英] C for loop indexing: is forward-indexing faster in new CPUs?

查看:29
本文介绍了C for 循环索引:在新 CPU 中前向索引是否更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我订阅的一个邮件列表中,两个知识渊博的 (IMO) 程序员正在讨论一些优化的代码,并说:

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

在 5-8 年前发布的 CPU 上,向后迭代 for 循环稍微快一些(eg for (int i=x-1; i>=0; i--) {...}) 因为将 i 与零进行比较比将其与其他数字进行比较更有效.但是对于最近的 CPU(例如,从 2008 年到 2009 年),推测加载器逻辑是这样的,如果 for 循环向前迭代,它会更好地工作(例如 for (int i=0; i).

On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).

我的问题是,这是真的吗?CPU 实现最近是否发生了变化,以至于前向循环迭代现在比后向迭代更具优势?如果是这样,对此的解释是什么?发生了什么变化?

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(是的,我知道,过早的优化是万恶之源,在担心微优化等之前检查我的算法......主要是我只是好奇)

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)

推荐答案

你真正问的是预取,而不是循环控制逻辑.

You're really asking about prefetching, not about loop control logic.

一般来说,循环性能不会由控制逻辑决定(即增量/减量和每次检查的条件).除了非常的循环之外,完成这些事情所花费的时间是无关紧要的.如果您对此感兴趣,请查看 John Knoeller 的回答 了解有关 8086 计数器寄存器的详细信息以及为什么在过去倒计时更有效的原因.正如约翰所说,分支预测(以及推测)可以在这里的性能中发挥作用,正如指令预取.

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

迭代顺序可以在改变循环接触内存的顺序时显着影响性能.您请求内存地址的顺序可能会影响到您的缓存中的内容以及被驱逐的内容当不再有空间获取新的缓存行时,从您的缓存中删除.比需要更频繁地访问内存比比较、增量或减量要昂贵得多.在现代 CPU 上,从处理器到内存可能需要数千个周期,而且您的处理器可能不得不在这段时间内闲置一段时间.

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

您可能熟悉缓存,所以我不会详细介绍这里.您可能不知道的是,现代处理器采用了大量预取器来尝试预测您接下来将在不同级别的内存层次结构中需要哪些数据.一旦他们做出预测,他们就会尝试从内存或较低级别的缓存中提取该数据,以便您在处理数据时获得所需的信息.根据他们获取您接下来需要的内容的程度,您在使用它们时的表现可能会也可能不会提高.

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

看看英特尔的硬件预取器优化指南.列出了四个预取器;两个用于 NetBurst 芯片:

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:

  1. NetBurst 的硬件预取器可以检测向前或向后的内存访问流,并尝试将这些位置的数据加载到 L2 缓存中.
  2. NetBurst 有一个相邻缓存线 (ACL) 预取器,它会在您获取第一个缓存线时自动加载两个相邻的缓存线.
  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

还有两个用于 Core:

  1. Core 有一个稍微复杂的硬件预取器;除了连续引用流之外,它还可以检测跨步访问,因此如果您每隔一个元素、每 4 个等遍历一个数组,它会做得更好.
  2. Core 还有一个像 NetBurst 这样的 ACL 预取器.
  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

如果你向前遍历一个数组,你将生成一堆顺序的,通常是连续的内存引用.ACL 预取器对于前向循环(因为您最终将使用这些后续缓存行)比后向循环做得更好,但是如果预取器可以检测到这一点(与硬件一样),您可以向后进行内存引用预取器).Core 上的硬件预取器可以检测步幅,这有助于更复杂的数组遍历.

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

在某些情况下,这些简单的启发式可能会让您陷入困境.例如,英特尔实际上建议您关闭服务器的相邻缓存行预取,因为与桌面用户机器相比,它们往往会进行更多的随机内存引用.在服务器上使用相邻缓存行的可能性更高,因此获取您实际上不打算使用的数据最终会污染您的缓存(用不需要的数据填充它),并且性能会受到影响.有关解决此类问题的更多信息,请查看 Supercomputing 2009使用机器学习在大型数据中心调整预取器.谷歌的一些人就在那篇论文上;性能是他们非常关心的事情.

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

简单的启发式方法不会帮助您使用更复杂的算法,您可能必须开始考虑 L1、L2 等缓存的大小.例如,图像处理通常需要您对 2D 图像的子部分执行一些操作,但是遍历图像的顺序会影响它在缓存中保留的有用部分而不会被逐出的程度.看看 Z-order traversals循环平铺 如果你对这类事情感兴趣.这是将图像数据的 2D 局部性映射到内存的 1D 局部性以提高性能的一个非常基本的示例.这也是编译器无法始终以最佳方式重构代码的领域,但手动重构 C 代码可以显着提高缓存性能.

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

我希望这能让您了解迭代顺序如何影响内存性能.它确实取决于特定的架构,但这些想法是通用的.如果您能在 Intel 上理解预取,您应该能够理解 AMD 和 Power 上的预取,并且您不必真正了解汇编来构建代码以利用内存.您只需要了解一点计算机架构即可.

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

这篇关于C for 循环索引:在新 CPU 中前向索引是否更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆