如何编写最能利用 CPU 缓存来提高性能的代码? [英] How does one write code that best utilizes the CPU cache to improve performance?

查看:25
本文介绍了如何编写最能利用 CPU 缓存来提高性能的代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这听起来像是一个主观问题,但我正在寻找的是具体实例,您可能遇到过与此相关的情况.

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.

  1. 如何使代码、缓存有效/缓存友好(更多缓存命中,尽可能少的缓存未命中)?从两个角度来看,数据缓存&程序缓存(指令缓存),即代码中与数据结构和代码构造相关的哪些内容应该注意以使其缓存有效.

  1. How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache), i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.

是否有任何必须使用/避免的特定数据结构,或者是否有访问该结构成员的特定方式等...以使代码缓存有效.

Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.

是否有任何程序结构(if、for、switch、break、goto、...)、代码流(对于 if 内部、if 内部 for 等...)应该遵循/避免在这件事上?

Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?

我期待听到与制作缓存高效代码相关的个人经验.它可以是任何编程语言(C、C++、汇编...)、任何硬件目标(ARM、Intel、PowerPC...)、任何操作系统(Windows、Linux、S ymbian...)等..

I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..

多样性将有助于更好地深入理解它.

The variety will help to better to understand it deeply.

推荐答案

缓存的存在是为了减少 CPU 在等待内存请求被满足时停止的次数(避免内存延迟),并且作为第二个效果,可能会减少需要传输的数据总量(保留内存带宽).

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).

避免遭受内存获取延迟的技术通常是首先要考虑的事情,有时会大有帮助.有限的内存带宽也是一个限制因素,特别是对于许多线程想要使用内存总线的多核和多线程应用程序.一组不同的技术有助于解决后一个问题.

Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.

改进空间局部性意味着您确保每个缓存行在映射到缓存后都被完全使用.当我们查看各种标准基准时,我们发现在缓存行被逐出之前,其中很大一部分未能使用 100% 获取的缓存行.

Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.

提高缓存线利用率在三个方面有帮助:

Improving cache line utilization helps in three respects:

  • 它倾向于在缓存中放入更多有用的数据,从本质上增加有效缓存大小.
  • 它倾向于在同一缓存行中放入更多有用的数据,从而增加在缓存中找到请求数据的可能性.
  • 它降低了内存带宽要求,因为提取次数会减少.

常用技术有:

  • 使用较小的数据类型
  • 组织您的数据以避免对齐漏洞(通过减小大小对结构成员进行排序是一种方法)
  • 注意标准的动态内存分配器,它可能会在内存升温时引入漏洞并将数据散布在内存中.
  • 确保在热循环中实际使用了所有相邻数据.否则,请考虑将数据结构分解为热组件和冷组件,以便热循环使用热数据.
  • 避免使用表现出不规则访问模式的算法和数据结构,而倾向于使用线性数据结构.

我们还应该注意,除了使用缓存之外,还有其他方法可以隐藏内存延迟.

We should also note that there are other ways to hide memory latency than using caches.

现代 CPU:通常有一个或多个硬件预取器.他们在缓存中训练未命中并尝试发现规律.例如,在几次未命中后续缓存线后,硬件预取器将开始将缓存线提取到缓存中,以预测应用程序的需求.如果您有常规访问模式,则硬件预取器通常会做得很好.如果您的程序不显示常规访问模式,您可以通过自己添加预取指令来改进.

Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

以这样的方式重新组合指令,使那些总是在缓存中未命中的指令彼此靠近,CPU 有时可以重叠这些提取,以便应用程序仅维持一个延迟命中(内存级并行).

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).

为了降低整体内存总线压力,您必须开始解决所谓的时间局部性.这意味着您必须在数据尚未从缓存中清除时重用数据.

To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.

合并接触相同数据的循环(循环融合),并采用被称为平铺阻塞的重写技术都努力避免这些额外的内存提取.

Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.

虽然此重写练习有一些经验法则,但您通常必须仔细考虑循环携带的数据依赖性,以确保不会影响程序的语义.

While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

这些东西在多核世界中真正得到回报,在添加第二个线程后,您通常不会看到太多的吞吐量改进.

These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.

这篇关于如何编写最能利用 CPU 缓存来提高性能的代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆