如何编写一个最能充分利用CPU缓存来提高性能的代码? [英] How does one write code that best utilizes the CPU cache to improve performance?

查看:257
本文介绍了如何编写一个最能充分利用CPU缓存来提高性能的代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这听起来像是一个主观的问题,但是我正在寻找的是与您相关的特定实例。

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.


  1. 如何使代码,缓存有效/对缓存友好(缓存命中率越高,缓存未命中越少)?从两个角度来看,数据缓存和程序缓存(指令缓存),
    ,即代码中与数据结构和代码构造有关的内容,应注意使其有效缓存。

  1. How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache), i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.

是否存在必须使用/避免的任何特定数据结构,或者是否存在访问该结构的成员的特定方法等,以使代码缓存有效。

Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.

是否有任何程序构造(if,for,switch,break,goto等),代码流(if内,for内等)?是否关注/避免此事?

Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?

我希望听到与制作高速缓存有效代码有关的个人经验。它可以是任何编程语言(C,C ++,Assembly等),任何硬件目标(ARM,Intel,PowerPC等),任何操作系统(Windows,Linux,Symbian等)。

I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..

该品种将有助于更好地深入理解它。

The variety will help to better to understand it deeply.

推荐答案

缓存可以减少CPU等待等待内存请求完成的次数(避免内存 latency ),并且第二个效果是可以减少总量需要传输的数据(保留内存带宽)。

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).

避免遭受内存提取延迟的技术通常是要考虑的第一件事,有时会有很大帮助。有限的内存带宽也是一个限制因素,尤其是对于许多线程希望使用内存总线的多核和多线程应用程序而言。

Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.

改善空间局部性意味着您确保每条缓存行都被完全使用,这将有助于解决后一个问题。已映射到缓存。当我们查看各种标准基准时,我们发现,在驱逐缓存行之前,其中有很大一部分未能使用100%的提取缓存行。

Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.

提高缓存行利用率在三个方面有所帮助:

Improving cache line utilization helps in three respects:


  • 它倾向于将更多有用的数据放入缓存中,从而实质上增加了有效缓存的大小。

  • 它倾向于在同一缓存行中容纳更多有用的数据,从而增加了可以在缓存中找到请求数据的可能性。

  • 它减少了

常用技术是:


  • 使用较小的数据类型

  • 组织数据以避免对齐漏洞(通过减小大小对结构成员进行排序是一种方法)

  • 当心标准的动态内存分配器,它可能会引入漏洞并在数据预热时在内存中散布数据。

  • 确保在热循环中实际使用了所有相邻数据。否则,请考虑将数据结构分解为热成分和冷成分,以便热循环使用热数据。

  • 避免使用表现出不规则访问模式的算法和数据结构,而建议使用线性数据结构。
  • li>
  • Use smaller data types
  • Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
  • Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
  • Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
  • avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.

我们还应该注意,除了使用缓存以外,还有其他隐藏内存延迟的方法。

We should also note that there are other ways to hide memory latency than using caches.

现代CPU:通常具有一个或多个硬件预取器。他们在缓存中对未命中进行训练,并尝试发现规律性。例如,在错过了后续的缓存行之后,硬件预取器将开始将缓存行提取到缓存中,从而预测应用程序的需求。如果您具有常规访问模式,则硬件预取器通常会做得很好。而且,如果您的程序未显示常规访问模式,则可以自己添加 prefetch指令来改善性能。

Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

以这种方式重新组合指令那些总是会在高速缓存中丢失的缓存器彼此靠近发生,因此CPU有时可能会使这些提取重叠,从而使应用程序仅承受一次延迟命中(内存级别并行性)。

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).

要减少总体内存总线压力,您必须开始解决所谓的临时位置。这意味着您必须重用尚未从缓存中清除的数据。

To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.

合并了接触相同数据的循环(循环融合),并采用称为 tiling blocking 的重写技术,都在努力避免获取这些额外的内存。

Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.

尽管此重写练习有一些经验法则,但通常必须仔细考虑循环承载的数据依赖性,以确保不影响程序的语义。

While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

在多核世界中,这些才是真正的回报,在添加第二个线程后,您通常不会看到很多吞吐量提高。

These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.

这篇关于如何编写一个最能充分利用CPU缓存来提高性能的代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆