什么是“缓存友好的”码? [英] What is "cache-friendly" code?

查看:213
本文介绍了什么是“缓存友好的”码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可能给出缓存不友好的代码和该代码的缓存友好版本的示例吗?



如何确保我写cache-高效代码?

解决方案

Preliminaries



在现代计算机上,结构(寄存器)可以在单个时钟周期内移动数据。然而,寄存器是非常昂贵的,并且大多数计算机核心具有少于十几个寄存器(总共几百到大约一千个字节)。在存储器频谱( DRAM )的另一端,存储器非常便宜(即,字面上以千百万次更便宜),但是在接收请求之后需要数百个周期数据。为了弥补超快速和昂贵以及超慢和便宜之间的这种差距是在降低速度和成本方面名为L1,L2,L3的高速缓冲存储器。想法是,大多数执行代码将经常击中一小组变量,其余(更大的一组变量)不频繁。如果处理器不能在L1高速缓存中找到数据,则它在L2高速缓存中查找。如果没有,那么L3缓存,如果不存在,主内存。这些未命中中的每一个在时间上都是昂贵的。



(类比是高速缓冲存储器是系统存储器,系统存储器是硬盘存储器,超便宜,但很慢)。



缓存是减少延迟影响的主要方法之一(cfr。Herb Sutter talk我在开头链接)。转述Herb Sutter(参见下面的链接):增加带宽很容易,但我们不能在延迟之外购买我们的方式



数据总是通过存储器层次结构检索(最小=最快到最慢)。缓存命中/未命中通常是指CPU中的最高级别的缓存中的命中/未命中 - 通过最高级别意味着最大==最慢。缓存命中率对于性能至关重要,因为每个缓存缺失导致从RAM(或更差...)获取数据,这需要很多 时间(数百个周期对于RAM,数千万周期的HDD)。相比之下,从(最高级)高速缓存读取数据通常仅需要少量的周期。在现代计算机体系结构中,性能瓶颈正在离开CPU芯片(例如,访问RAM或更高)。这只会变得越来越糟糕。处理器频率的增加目前不再与提高性能相关。 问题是内存访问。因此,CPU中的硬件设计工作目前主要侧重于优化缓存,预取,管道和并发。例如,现代CPU花费大约85%的高速缓存和高达99%的高速缓存存储/移动数据!



有很多话要说。下面是关于缓存,内存分层结构和正确编程的几个很好的参考:





缓存友好代码的主要概念



缓存友好代码的一个非常重要的方面是 地域性原则 ,其目标是将相关数据放在内存中以允许有效的缓存。在CPU缓存方面,了解缓存行以了解其工作原理很重要:如何

      >
    1. 临时位置:当访问给定的内存位置时,很可能在不久的将来再次访问同一位置。理想情况下,此信息仍会在此时缓存。

    2. 空间位置:这是指将相关数据放在彼此附近。缓存发生在很多层次上,而不仅仅是在CPU中。例如,当你从RAM中读取数据时,通常需要比特殊要求更大的内存块,因为程序很快就需要这些数据。 HDD缓存遵循同样的思路。特别对于CPU缓存,缓存行的概念很重要。

    href =/ questions / tagged / c%2b%2bclass =post-tagtitle =显示标记为'c ++'的问题 =tag> c ++ 容器 >

    一个简单的cache-friendly和cache-unfriendly示例是 c ++ std :: vector std :: list std :: vector 的元素存储在连续的存储器中,并且因此访问它们比访问更多的高速缓存友好 code> std :: list ,它将内容存储在整个地方。这是由于空间局部性。



    这是一个很好的例子,由Bjarne Stroustrup在此YouTube剪辑(感谢@Mohammad Ali Baydoun的链接!)



    数据结构和算法设计中的缓存



    尽可能尝试以允许最大限度使用缓存的方式调整数据结构和计算顺序。在这方面的常见技术是缓存阻止 (Archive.org版本),这在高性能计算中是非常重要的(例如 ATLAS )。



    了解并利用数据的隐式结构



    另一个简单的例子,许多人在现场有时忘记专栏(例如 fortran matlab )vs. row-major ordering =/ questions / tagged / cclass =post-tagtitle =显示问题已标记'c' =tag> c , c ++ )例如,考虑以下矩阵:

      1 2 
    3 4

    在行主排序中,它以 1 2 3 4 在列主排序中,它将被存储为 1 3 2 4 。很容易看出,没有利用这种排序的实现将快速进入(容易可避免的)缓存问题。不幸的是,我经常在我的域(机器学习)中看到像这样的东西。 @MatteoItalia在他的回答中更详细地显示了这个例子。



    当从内存中提取矩阵的某个元素时,它附近的元素也将被提取并存储在高速缓存线。如果使用排序,这将导致较少的存储器访问(因为随后计算所需的接下来的几个值已经在高速缓存行中)。



    为简单起见,假设缓存包含一个可以包含2个矩阵元素的缓存行,并且当从内存中提取给定元素时,下一个也是。假设我们要对上面示例的2x2矩阵中的所有元素求和(可以称为 M ):



    利用订单(例如,在中更改列索引c ++ ):

      M [0] [0](memory)+ M [0] [1]缓存)+ M [1] [0](存储器)+ M [1] [1](缓存)
    = 1 + 2 + 3 + 4
    - 2缓存命中,2个内存访问

    不利用排序=/ questions / tagged / c%2b%2bclass =post-tagtitle =显示问题已标记为'c ++' =tag> c ++ ):

      M [0] [0](存储器)+ M [1] [0] M [1] [1](存储器)
    = 1 + 3 + 2 + 4
    -> 0缓存命中,4个内存访问



    在这个简单的例子中,利用排序大约是执行速度的两倍存储器访问比计算总和需要更多的周期)。实际上,性能差异可能会大得多。



    避免不可预测的分支

    现代架构特性管线和编译器正在变得非常好地重新排序代码,以最大限度地减少由于内存访问的延迟。当您的关键代码包含(不可预测的)分支时,很难或不可能预取数据。这会间接导致更多的缓存未命中。



    这在这里很好解释(感谢@ 0x90链接):为什么处理排序的数组比处理未排序的数组更快?



    避免虚拟函数



    a href =/ questions / tagged / c%2b%2bclass =post-tagtitle =显示已标记的问题'c ++' =tag> c ++ , virtual 方法代表了关于高速缓存未命中的一个有争议的问题(一般的共识是,在性能方面尽可能避免它们)。虚拟函数可以在查找期间引起高速缓存未命中,但是这仅在如果特定函数不经常调用(否则可能被高速缓存)时发生,因此这被看作是非问题。有关此问题的参考,请查看:在C ++类中使用虚拟方法的性能成本是多少?



    常见问题



    使用多处理器缓存的现代体系结构中的常见问题称为 false sharing 。这发生在每个单独的处理器试图使用另一存储器区域中的数据并尝试将其存储在同一高速缓存行中时。这导致缓存行(包含另一个处理器可以使用的数据)被重复覆盖。实际上,不同的线程通过在这种情况下引发高速缓存未命中使彼此等待。
    另请参见(感谢@Matt的链接):如何以及何时对齐缓存行大小?



    RAM内存缓存不良的极端症状(这可能不是你在这里的意思上下文)是所谓的颠覆。当进程连续产生需要磁盘访问的页错误(例如访问不在当前页中的内存)时,会发生这种情况。


    Could someone possibly give an example of "cache unfriendly code" and the "cache friendly" version of that code?

    How can I make sure I write cache-efficient code?

    解决方案

    Preliminaries

    On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers (few hundred to maybe a thousand bytes total). At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

    (The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap, but very slow).

    Caching is one of the main methods to reduce the impact of latency (cfr. the Herb Sutter talk I linked at the start). To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

    Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance, since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

    In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

    There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

    Main concepts for cache-friendly code

    A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

    The following particular aspects are of high importance to optimize caching:

    1. Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
    2. Spatial locality: this refers to placing related data close to eachother. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.

    Use appropriate containers

    A simple example of cache-friendly versus cache-unfriendly is 's std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

    A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).

    Don't neglect the cache in data structure and algorithm design

    Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. An common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

    Know and exploit the implicit structure of data

    Another simple example, which many people in the field sometimes forget is column-major (ex. ,) vs. row-major ordering (ex. ,) for storing two dimensional arrays. For example, consider the following matrix:

    1 2
    3 4
    

    In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

    When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

    For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

    Exploiting the ordering (e.g. changing column index first in ):

    M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
    = 1 + 2 + 3 + 4
    --> 2 cache hits, 2 memory accesses
    

    Not exploiting the ordering (e.g. changing row index first in ):

    M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
    = 1 + 3 + 2 + 4
    --> 0 cache hits, 4 memory accesses
    

    In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice the performance difference can be much larger.

    Avoid unpredictable branches

    Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

    This is explained very well here (thanks to @0x90 for the link): Why is it faster to process a sorted array than an unsorted array?

    Avoid virtual functions

    In the context of , virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

    Common problems

    A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?

    An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

    这篇关于什么是“缓存友好的”码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆