什么是“缓存友好"?代码? [英] What is a "cache-friendly" code?

查看:36
本文介绍了什么是“缓存友好"?代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

缓存不友好代码"和缓存友好"代码有什么区别?

What is the difference between "cache unfriendly code" and the "cache friendly" code?

我如何确保我编写了缓存高效的代码?

How can I make sure I write cache-efficient code?

推荐答案

Preliminaries

在现代计算机上,只有最低级别的内存结构(寄存器)可以在单个时钟周期内移动数据.然而,寄存器非常昂贵,大多数计算机内核只有不到几十个寄存器.在内存频谱的另一端 (DRAM),内存非常便宜(即便宜数百万倍),但在收到请求后需要数百个周期数据.为了弥合超快与昂贵与超慢与便宜之间的差距,高速缓存,以降低速度和成本的方式命名为 L1、L2、L3.这个想法是大多数正在执行的代码将经常访问一小组变量,而其余的(更大的变量集)很少.如果处理器在 L1 缓存中找不到数据,则它会在 L2 缓存中查找.如果不存在,则为 L3 缓存,如果不存在,则为主内存.这些未命中"中的每一个都被排除在外.时间成本高.

Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(类比是缓存之于系统内存,就像系统内存之于硬盘存储.硬盘存储超级便宜但速度非常慢).

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).

缓存是减少延迟影响的主要方法之一.用 Herb Sutter 的话说(参见下面的链接):增加带宽很容易,但我们无法摆脱延迟.

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

数据总是通过内存层次结构检索(最小==最快到最慢).缓存命中/未命中通常是指 CPU 中最高级别的缓存中的命中/未命中 - 最高级别我的意思是最大 == 最慢.缓存命中率对性能至关重要,因为每次缓存未命中都会导致从 RAM 中获取数据(或更糟...),这需要很多时间(数百个周期)RAM,HDD 的数千万个周期).相比之下,从(最高级别)缓存读取数据通常只需要几个周期.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

在现代计算机架构中,性能瓶颈在于 CPU 芯片(例如访问 RAM 或更高).随着时间的推移,这只会变得更糟.处理器频率的增加目前与提高性能不再相关.问题在于内存访问.因此,CPU 中的硬件设计工作目前主要集中在优化缓存、预取、管道和并发性上.例如,现代 CPU 将大约 85% 的芯片用于缓存,高达 99% 用于存储/移动数据!

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

关于这个主题有很多话要说.以下是一些关于缓存、内存层次结构和正确编程的重要参考资料:

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

  • Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
  • If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
  • Slides about memory optimization by Christer Ericson (director of technology @ Sony)
  • LWN.net's article "What every programmer should know about memory"

缓存友好代码的一个非常重要的方面是关于局部性原则,其目标是将相关数据靠近内存以允许有效缓存.在 CPU 缓存方面,了解缓存行以了解其工作原理很重要:How缓存行是否有效?

A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

以下特定方面对于优化缓存非常重要:

The following particular aspects are of high importance to optimize caching:

  1. 时间局部性:当访问给定的内存位置时,很可能在不久的将来再次访问同一位置.理想情况下,此时仍会缓存此信息.
  2. 空间局部性:这是指将相关数据彼此靠近放置.缓存发生在许多级别,而不仅仅是在 CPU 中.例如,当您从 RAM 读取时,通常会获取比具体要求的更大的内存块,因为程序通常很快就会需要这些数据.HDD 缓存遵循相同的思路.特别是对于 CPU 缓存,缓存行的概念很重要.
  1. Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
  2. Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.

使用适当的 容器

Use appropriate c++ containers

缓存友好与缓存不友好的一个简单例子是 std::vector 对比 std::list.std::vector 的元素存储在连续内存中,因此访问它们比访问 std::list 中的元素对缓存更友好,将其内容存储在各处.这是由于空间局部性.

A simple example of cache-friendly versus cache-unfriendly is c++'s std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

Bjarne Stroustrup 在 这个 youtube 剪辑(感谢@Mohammad Ali Baydoun 提供链接!).

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).

在数据结构和算法设计中不要忽视缓存

只要有可能,尽量以允许最大程度地利用缓存的方式调整数据结构和计算顺序.这方面的常用技术是缓存阻塞 (Archive.org 版本),这在高性能计算中极为重要(参见例如 图集).

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

了解并利用数据的隐式结构

另一个简单的例子,该领域的许多人有时会忘记列主要(例如 ,) 与行优先排序(例如 ,) 用于存储二维数组.例如,考虑以下矩阵:

Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:

1 2
3 4

在行优先排序中,这在内存中存储为1 2 3 4;在列优先排序中,这将存储为 1 3 2 4.很容易看出,不利用这种排序的实现将很快遇到(很容易避免!)缓存问题.不幸的是,我经常在我的领域(机器学习)中看到这样的东西.@MatteoItalia 在他的回答中更详细地展示了这个例子.

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

当从内存中获取矩阵的某个元素时,它附近的元素也将被获取并存储在缓存行中.如果利用排序,这将导致更少的内存访问(因为后续计算所需的下几个值已经在缓存行中).

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

为简单起见,假设缓存包含一个可以包含 2 个矩阵元素的单个缓存行,并且当从内存中获取给定元素时,下一个也是如此.假设我们想对上面示例 2x2 矩阵中的所有元素求和(我们称之为 M):

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

利用排序(例如在):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

不利用排序(例如,首先在 ):

Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

在这个简单的例子中,利用排序大约使执行速度加倍(因为内存访问需要比计算总和更多的周期).在实践中,性能差异可以.

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.

避免不可预测的分支

现代架构的特点是管道和编译器在重新排序代码方面变得非常擅长,以最大限度地减少由于内存访问造成的延迟.当您的关键代码包含(不可预测的)分支时,就很难或不可能预取数据.这将间接导致更多缓存未命中.

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

这里很好地解释了(感谢@0x90 提供链接):为什么处理排序数组比处理未排序数组快?

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

避免使用虚函数

, virtual 方法代表了一个关于缓存未命中的有争议的问题(普遍存在的共识是,就性能而言,应尽可能避免它们).虚函数在查找过程中可能会导致缓存未命中,但这只会发生特定函数不经常调用(否则它可能会被缓存),因此某些人认为这不是问题.有关此问题的参考,请查看:在 C++ 类中使用虚方法的性能成本是多少?

In the context of c++, virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

具有多处理器缓存的现代架构中的一个常见问题称为错误共享.当每个单独的处理器尝试使用另一个内存区域中的数据并尝试将其存储在同一缓存行中时,就会发生这种情况.这会导致缓存线——包含另一个处理器可以使用的数据——被一次又一次地覆盖.实际上,在这种情况下,不同的线程通过引起缓存未命中来使彼此等待.另请参阅(感谢@Matt 提供链接):如何以及何时与缓存行大小对齐?

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?

RAM 内存中缓存不良的极端症状(这可能不是您在此上下文中的意思)所谓的 颠簸.当进程不断产生需要磁盘访问的页面错误(例如访问不在当前页面中的内存)时,就会发生这种情况.

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

这篇关于什么是“缓存友好"?代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆