为什么线性读混排写入不比随机读线性写入快? [英] Why is linear read-shuffled write not faster than shuffled read-linear write?

查看:111
本文介绍了为什么线性读混排写入不比随机读线性写入快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试更好地了解与内存/缓存相关的性能问题.我在某处读到,内存局部性对于读取比对写入更重要,因为在前一种情况下,CPU必须实际等待数据,而在后一种情况下,CPU可以将它们运送出去而忽略它们.

考虑到这一点,我进行了以下快速测试:我编写了一个脚本,该脚本创建了一个由N个随机浮点数和排列组成的数组,即一个以随机顺序包含数字0到N-1的数组.然后,它反复进行以下操作:(1)线性读取数据数组,然后按排列给定的随机访问模式将其写回到新数组,或者(2)按排列顺序读取数据数组,然后将其线性写入新数组.

令我惊讶的是,(2)似乎始终比(1)快.但是,我的脚本有问题

  • 脚本是用python/numpy编写的.这是一种相当高级的语言,目前尚不清楚读写是如何实现的.
  • 我可能没有适当地权衡这两种情况.

此外,下面的一些答案/评论表明我的最初期望不正确,并且根据cpu缓存的详细信息,这两种情况都可能更快.

我的问题是:

  • 两者中的哪一个(如果有的话)应该更快?
  • 这里的相对缓存概念是什么;它们如何影响结果

对初学者友好的解释将不胜感激.任何支持的代码都应该在C/cython/numpy/numba或python中.

可选:

  • 解释为什么绝对持续时间在问题规模上是非线性的(请参见下面的时间安排).
  • 说明我明显不足的python实验的行为.

供参考,我的平台是Linux-4.12.14-lp150.11-default-x86_64-with-glibc2.3.4. Python版本是3.6.5.

这是我写的代码:

import numpy as np
from timeit import timeit

def setup():
    global a, b, c
    a = np.random.permutation(N)
    b = np.random.random(N)
    c = np.empty_like(b)

def fwd():
    c = b[a]

def inv():
    c[a] = b

N = 10_000
setup()

timeit(fwd, number=100_000)
# 1.4942631321027875
timeit(inv, number=100_000)
# 2.531870319042355

N = 100_000
setup()

timeit(fwd, number=10_000)
# 2.4054739447310567
timeit(inv, number=10_000)
# 3.2365565397776663

N = 1_000_000
setup()

timeit(fwd, number=1_000)
# 11.131387163884938
timeit(inv, number=1_000)
# 14.19817715883255

正如@Trilarion和@Yann Vernier所指出的那样,我的代码片段没有达到适当的平衡,所以我将其替换为

def fwd():
    c[d] = b[a]
    b[d] = c[a]

def inv():
    c[a] = b[d]
    b[a] = c[d]

其中d = np.arange(N)(我将两种方式都进行了混洗,以期减少跨试用缓存的影响).我还用repeat替换了timeit,并将重复次数减少了10倍.

然后我得到

[0.6757169323973358, 0.6705542299896479, 0.6702114241197705]    #fwd
[0.8183442652225494, 0.8382121799513698, 0.8173762648366392]    #inv
[1.0969422250054777, 1.0725746559910476, 1.0892365919426084]    #fwd
[1.0284497970715165, 1.025063106790185, 1.0247828317806125]     #inv
[3.073981977067888, 3.077839042060077, 3.072118630632758]       #fwd
[3.2967213969677687, 3.2996009718626738, 3.2817375687882304]    #inv

因此似乎仍然存在差异,但是它更加微妙,现在可以根据问题的大小来选择哪种方式.

解决方案

这是一个与现代处理器的体系结构功能密切相关的复杂问题,您的直觉认为随机读取要比随机写入慢,因为CPU必须等待用于读取数据 未验证(大部分时间).我会详细说明几个原因.

  1. 现代处理器非常有效地隐藏了读取延迟

  2. 内存写入比内存读取昂贵

  3. 尤其是在多核环境中

原因#1现代处理器可以有效地隐藏读取延迟.

现代超标量可以同时执行多条指令,并更改指令的执行顺序(乱序执行). 这些功能的首要原因是增加指令处理量, 最有趣的结果之一是处理器隐藏内存写入(或复杂运算符,分支等)的延迟的能力.

为说明这一点,让我们考虑一个简单的代码,它将数组复制到另一个代码中.

for i in a:
    c[i] = b[i]

由处理器执行的一个已编译的代码就是这样

#1. (iteration 1) c[0] = b[0]
1a. read memory at b[0] and store result in register c0
1b. write register c0 at memory address c[0]
#2. (iteration 2) c[1] = b[1]
2a. read memory at b[1] and store result in register c1
2b. write register c1 at memory address c[1]
#1. (iteration 2) c[2] = b[2]
3a. read memory at b[2] and store result in register c2
3b. write register c2 at memory address c[2]
# etc

(这过分地简化了,实际代码更加复杂,必须处理循环管理,地址计算等,但是这种简单的模型目前就足够了.)

如问题中所述,对于读取,处理器必须等待实际数据.实际上,1b需要1a提取的数据,并且只要1a未完成就无法执行.这样的约束称为 dependency ,我们可以说1b依赖于1a.依赖关系是现代处理器中的主要概念.依赖关系表示算法(例如,我将b写入c),必须绝对遵守.但是,如果指令之间不存在依赖关系,则处理器将尝试执行其他未决指令,以使操作管线始终保持活动状态.只要遵守依赖关系(类似于as-if规则),这就会导致执行混乱.

对于所考虑的代码,高级指令2和1之间(或asm指令2a和2b与先前的指令之间)没有依赖.实际上,最终结果甚至是相同的,是2.在1.之前执行,并且处理器将在完成1a和1b之前尝试执行2a和2b. 2a和2b之间仍然存在依赖关系,但是都可以发布.同样适用于3a.和3b.,依此类推.这是隐藏内存延迟的有力手段.如果出于某种原因2.,3和4.可以在1.加载其数据之前终止,则您甚至可能根本没有注意到任何减速.

此指令级并行性由处理器中的一组队列"管理.

  • 预留站RS中的待处理指令队列(最近的pentium中的128型指令).一旦该指令所需的资源可用(例如,指令1b的寄存器c1的值),该指令即可执行.

  • 在L1高速缓存之前的内存顺序缓冲区MOB中的未决内存访问队列.这是处理内存别名并确保在同一地址进行内存写入或加载(通常为64个加载,32个存储)的顺序所必需的.

  • 出于类似原因,在将结果写回寄存器(168个条目的重排序缓冲区或ROB)时强制执行顺序的队列.

  • 和其他一些取指令处的队列,用于生成&op; ops,在缓存中写入和丢失缓冲区等

在执行上一个程序的某一时刻,RS中将有许多待处理的存储指令,MOB中将有多个加载,而ROB中有待退役的指令.

一旦数据可用(例如读取终止),取决于指令可以执行,从而释放队列中的位置.但是,如果没有终止发生,并且这些队列之一已满,则与此队列关联的功能单元将停顿(如果处理器缺少寄存器名,则在发出指令时也会发生这种情况).停顿是造成性能下降的原因,为了避免这种情况,必须限制队列填充.

这说明了线性和随机存储器访问之间的区别.
在线性访问中,由于更好的空间局部性,并且未命中的次数会减少,并且由于高速缓存可以使用常规模式预取访问以进一步减少访问次数,因此2//每当读取终止时,它将涉及完整的高速缓存行和可以释放几条待处理的加载指令,从而限制了指令队列的填充.这样,处理器将一直处于忙碌状态,并且隐藏了内存延迟.
对于随机访问,未命中的次数会更高,并且数据到达时只能处理一次加载.因此,指令队列将迅速饱和,处理器停顿,并且无法通过执行其他指令来隐藏内存延迟.

为了避免队列饱和和停顿,必须在吞吐量方面平衡处理器体系结构.实际上,在处理器的某个执行阶段,通常会有数十条指令,并且全局吞吐量(即,通过内存(或功能单元)满足指令请求的能力)是决定性能的主要因素.比其中一些等待执行的指令正在等待一个内存值的事实影响较小……

...除非您的依赖链较长,否则.

当一条指令必须等待上一条指令的完成时,存在依赖性.使用读取结果是一个依赖项.当涉及到依赖链时,依赖可能是个问题.

例如,考虑代码for i in range(1,100000): s += a[i].所有的内存读取都是独立的,但是s中的累积有一个依赖链.在上一个终止之前,无法进行添加.这些依赖性将使预订站迅速装满并在管道中造成停顿.

但是在依赖链中很少涉及读取.仍然可以想象一下病理代码,其中所有读取都依赖于前一个读取(例如for i in range(1,100000): s = a[s]),但在实际代码中并不常见.问题出在依赖链上,而不是事实.对于像for i in range(1,100000): x = 1.0/x+1.0这样的与计算绑定相关的代码,情况将类似(甚至可能更糟).

因此,除某些情况外,由于超标量输出或订单执行掩盖了延迟,因此计算时间与吞吐量的关系比与读取依赖性的关系更大.对于吞吐量,写比写要差.

原因2:内存写入(尤其是随机写入)比内存读取昂贵

这与缓存的行为方式有关.高速缓存是快速内存,由处理器存储一部分内存(称为 line ).高速缓存行目前为64个字节,并允许利用内存引用的空间局部性:一旦存储了一行,该行中的所有数据将立即可用.这里的重要方面是缓存和内存之间的所有传输都是行.

当处理器对数据进行读取时,缓存将检查数据所属的行是否在缓存中.如果不是,则从内存中获取该行,将其存储在缓存中,然后将所需的数据发送回处理器.

当处理器将数据写入内存时,高速缓存还会检查线路是否存在.如果该行不存在,则高速缓存无法将其数据发送到内存(因为所有传输都是基于行的),并执行以下步骤:

  1. 缓存从内存中提取行并将其写入缓存行.
  2. 数据被写入高速缓存,并且整行被标记为已修改(脏)
  3. 当从缓存中取消一行时,它会检查修改后的标志,如果该行已被修改,则会将其写回到内存中(写回缓存)

因此,每次内存写入之前都必须先进行一次内存读取,以获取高速缓存中的行.这增加了一个额外的操作,但是对于线性写入来说并不是很昂贵.对于第一个写入的单词,将发生高速缓存未命中和内存读取,但是连续写入将仅涉及高速缓存并被命中.

但是随机写入的情况却大不相同.如果未命中的次数很重要,则每个高速缓存未命中都意味着在从高速缓存中弹出该行之前先进行一次读取,然后执行少量写操作,这会大大增加写入成本.如果在单次写入后弹出一行,我们甚至可以认为一次写入是读取的时间开销的两倍.

重要的是要注意,增加内存访问次数(读取或写入)会趋于饱和内存访问路径,并全局减慢处理器与内存之间的所有传输速度.

在任何一种情况下,写入总是比读取更昂贵.多核增强了这一方面.

原因3:随机写入会在多核中造成缓存未命中

不确定这是否真的适用于问题的情况.虽然numpy BLAS例程是多线程的,但我认为基本数组副本不是.但这是密切相关的,这也是写入成本更高的另一个原因.

多核的问题是要确保正确的缓存一致性,以确保数据共享由几个处理器正确地更新了每个内核的缓存中.这是通过诸如 MESI 之类的协议完成的,该协议会在写入之前更新缓存行,并且使其他缓存副本无效(读取所有权).

尽管问题的核心之间(或其并行版本)实际上没有共享任何数据,但请注意,该协议适用于缓存行.每当要修改高速缓存行时,都会从保存了最新副本的高速缓存行中复制它,在本地进行更新,并且所有其他副本都将失效.即使内核正在访问高速缓存行的不同部分.这种情况称为错误共享,它是多核编程的重要问题.

关于随机写入的问题,高速缓存行为64字节,可以容纳8个int64,如果计算机具有8个内核,则每个内核将平均处理2个值.因此,存在重要的虚假共享,这会减慢写入速度.


我们进行了一些性能评估.它以C语言执行,以包括对并行化影响的评估.我们比较了5 处理大小为N的int64数组的函数.

  1. 只是b到c(c[i] = b[i])的副本(由编译器使用memcpy()实现)

  2. 使用线性索引c[i] = b[d[i]]复制,其中d[i]==i(read_linear)

  3. 使用随机索引c[i] = b[a[i]]复制,其中a是随机索引 0..N-1的排列(read_random等同于原始问题中的fwd)

  4. 写线性c[d[i]] = b[i]其中d[i]==i(write_linear)

  5. a随机写入c[a[i]] = b[i] 0..N-1的排列(write_random等同于问题中的inv)

代码已使用gcc -O3 -funroll-loops -march=native -malign-double编译 天空处理器.性能用_rdtsc()进行测量,并且 每次迭代的周期数.该函数执行了多次(1000-20000个,具体取决于数组大小),执行了10个实验,并保留了最短的时间.

数组大小在4000到1200000之间.所有代码都已通过openmp的顺序版本和并行版本进行了测量.

这是结果图.函数具有不同的颜色,顺序的版本用粗线表示,而并行的版本用细线表示.

(显然)直接复制是最快的,是由gcc通过以下方式实现的: 高度优化的memcpy().这是通过内存估算数据吞吐量的一种方法.它的范围从小矩阵的每次迭代0.8个周期(CPI)到大矩阵的2.0 CPI.

读取线性性能大约是memcpy的两倍,但是有2次读取和1次写入,而1次 直接复制的读写操作.更多的索引增加了一些依赖性.最小值为1.56 CPI,最大值为3.8 CPI.线性写入稍长一些(5-10%).

使用随机索引进行读取和写入是原始问题的目的,应有较长的注释.这是结果.

size    4000    6000    9000    13496   20240   30360   45536   68304   102456  153680  230520  345776  518664  777992  1166984
rd-rand 1.86821 2.52813 2.90533 3.50055 4.69627 5.10521 5.07396 5.57629 6.13607 7.02747 7.80836 10.9471 15.2258 18.5524 21.3811
wr-rand 7.07295 7.21101 7.92307 7.40394 8.92114 9.55323 9.14714 8.94196 8.94335 9.37448 9.60265 11.7665 15.8043 19.1617 22.6785

  • 较小的值(< 10k):L1高速缓存为32k,可以容纳4k的uint64数组.请注意,由于索引的随机性,在大约1/8的迭代之后,L1高速缓存将完全填充随机索引数组的值(因为高速缓存行为64字节,并且可以容纳8个数组元素).访问其他线性阵列时,我们将快速生成许多L1丢失,并且我们必须使用L2缓存. L1缓存访问为5个周期,但它是流水线式的,每个周期可以提供几个值. L2访问时间更长,需要12个周期.随机读取和写入的未命中量相似,但我们发现,当数组大小较小时,我们将完全支付写入所需的两次访问.

  • 中值(10k-100k):L2缓存为256k,它可以容纳32k int64数组.之后,我们需要转到L3缓存(12Mo).随着大小的增加,L1和L2中的未命中数也会增加,因此计算时间也会相应增加.两种算法的未命中次数相似,这主要是由于随机读取或写入(其他访问是线性的,并且可以由缓存非常有效地预取).我们检索了B.M.中已经提到的随机读取和写入之间的因子二.回答.造成这种情况的部分原因可能是写入的双重费用.

  • 大数值(> 100k):方法之间的差异逐渐减小.对于这些大小,大部分信息存储在L3缓存中. L3大小足以容纳1.5M的完整阵列,并且不太可能弹出线.因此,对于写入而言,在初始读取之后,可以进行更多次写入而无需行弹出,并且降低了写入与读取之间的相对成本.对于这些大尺寸,还需要考虑许多其他因素.例如,高速缓存只能处理有限数量的未命中(典型值16),而当未命中数很大时,这可能是限制因素.

在并行omp版本上随机读取和写入一个字.除了较小的大小(随机索引数组分布在多个缓存中可能不是一个优势)之外,它们在系统上要快两倍左右.对于大尺寸文件,我们清楚地看到,由于错误的共享,随机读取和写入之间的间隔增大了.

即使对于简单的代码,使用当前计算机体系结构的复杂性进行定量预测几乎是不可能的,甚至对行为的定性解释也很困难,并且必须考虑许多因素.如其他答案所述,与python相关的软件方面也可能会产生影响.但是,尽管在某些情况下可能会发生这种情况,但在大多数情况下,人们不能认为由于数据依赖性而导致读取的开销更大.

I'm currently trying to get a better understanding of memory/cache related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.

With that in mind, I did the following quick-and-dirty test: I wrote a script that creates an array of N random floats and a permutation, i.e. an array containing the numbers 0 to N-1 in random order. Then it repeatedly either (1) reads the data array linearly and writes it back to a new array in the random access pattern given by the permutation or (2) reads the data array in the permuted order and linearly writes it to a new array.

To my surprise (2) seemed consistently faster than (1). There were, however, problems with my script

  • The script is written in python/numpy. This being quite a high-level language it is not clear how pecisely the read/write are implemented.
  • I probably did not balance the two cases properly.

Also, some of the answers/comments below suggest that my original expectation isn't correct and that depending on details of the cpu cache either case might be faster.

My question is:

  • Which (if any) of the two should be faster?
  • What are the relvant cache concepts here; how do they influence the result

A beginner-friendly explanation would be appreciated. Any supporting code should be in C / cython / numpy / numba or python.

Optionally:

  • Explain why the absolute durations are nonlinear in problem size (cf. timings below).
  • Explain the behavior of my clearly inadequate python experiments.

For reference, my platform is Linux-4.12.14-lp150.11-default-x86_64-with-glibc2.3.4. Python version is 3.6.5.

Here is the code I wrote:

import numpy as np
from timeit import timeit

def setup():
    global a, b, c
    a = np.random.permutation(N)
    b = np.random.random(N)
    c = np.empty_like(b)

def fwd():
    c = b[a]

def inv():
    c[a] = b

N = 10_000
setup()

timeit(fwd, number=100_000)
# 1.4942631321027875
timeit(inv, number=100_000)
# 2.531870319042355

N = 100_000
setup()

timeit(fwd, number=10_000)
# 2.4054739447310567
timeit(inv, number=10_000)
# 3.2365565397776663

N = 1_000_000
setup()

timeit(fwd, number=1_000)
# 11.131387163884938
timeit(inv, number=1_000)
# 14.19817715883255

As pointed out by @Trilarion and @Yann Vernier my snippets aren't properly balanced, so I replaced them with

def fwd():
    c[d] = b[a]
    b[d] = c[a]

def inv():
    c[a] = b[d]
    b[a] = c[d]

where d = np.arange(N) (I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced timeit with repeat and reduced the numbers of repeats by a factor of 10.

Then I get

[0.6757169323973358, 0.6705542299896479, 0.6702114241197705]    #fwd
[0.8183442652225494, 0.8382121799513698, 0.8173762648366392]    #inv
[1.0969422250054777, 1.0725746559910476, 1.0892365919426084]    #fwd
[1.0284497970715165, 1.025063106790185, 1.0247828317806125]     #inv
[3.073981977067888, 3.077839042060077, 3.072118630632758]       #fwd
[3.2967213969677687, 3.2996009718626738, 3.2817375687882304]    #inv

So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.

解决方案

This is a complex problem closely related to architectural features of modern processors and your intuition that random read are slower than random writes because the CPU has to wait for the read data is not verified (most of the time). There are several reasons for that I will detail.

  1. Modern processors are very efficient to hide read latency

  2. while memory writes are more expensive than memory reads

  3. especially in a multicore environment

Reason #1 Modern processors are efficient to hide read latency.

Modern superscalar can execute several instructions simultaneously, and change instruction execution order (out of order execution). While first reason for these features is to increase instruction thoughput, one of the most interesting consequence is the ability of processors to hide latency of memory writes (or of complex operators, branches, etc).

To explain that, let us consider a simple code that copies array into another one.

for i in a:
    c[i] = b[i]

One compiled, code executed by the processor will be somehow like that

#1. (iteration 1) c[0] = b[0]
1a. read memory at b[0] and store result in register c0
1b. write register c0 at memory address c[0]
#2. (iteration 2) c[1] = b[1]
2a. read memory at b[1] and store result in register c1
2b. write register c1 at memory address c[1]
#1. (iteration 2) c[2] = b[2]
3a. read memory at b[2] and store result in register c2
3b. write register c2 at memory address c[2]
# etc

(this is terribly oversimplified and the actual code is more complex and has to deal with loop management, address computation, etc, but this simplistic model is presently sufficient).

As said in the question, for reads, the processor has to wait for the actual data. Indeed, 1b need the data fetched by 1a and cannot execute as long as 1a is not completed. Such a constraint is called a dependency and we can say that 1b is dependent on 1a. Dependencies is a major notion in modern processors. Dependencies express the algorithm (eg I write b to c) and must absolutely be respected. But, if there is no dependency between instructions, processors will try to execute other pending instructions in order to keep there operative pipeline always active. This can lead to execution out-of-order, as long as dependencies are respected (similar to the as-if rule).

For the considered code, there no dependency between high level instruction 2. and 1. (or between asm instructions 2a and 2b and previous instructions). Actually the final result would even be identical is 2. is executed before 1., and the processor will try to execute 2a and 2b, before completion of 1a and 1b. There is still a dependency between 2a and 2b, but both can be issued. And similarly for 3a. and 3b., and so on. This is a powerful mean to hide memory latency. If for some reason 2., 3. and 4. can terminate before 1. loads its data, you may even not notice at all any slowdown.

This instruction level parallelism is managed by a set of "queues" in the processor.

  • a queue of pending instructions in the reservation stations RS (type 128 μinstructions in recent pentiums). As soon as resources required by the instruction is available (for instance value of register c1 for instruction 1b), the instruction can execute.

  • a queue of pending memory accesses in memory order buffer MOB before the L1 cache. This is required to deal with memory aliases and to insure sequentiality in memory writes or loads at the same address (typ. 64 loads, 32 stores)

  • a queue to enforce sequentiality when writing back results in registers (reorder buffer or ROB of 168 entries) for similar reasons.

  • and some other queues at instruction fetch, for μops generation, write and miss buffers in the cache, etc

At one point execution of the previous program there will be many pending stores instructions in RS, several loads in MOB and instructions waiting to retire in the ROB.

As soon as a data becomes available (for instance a read terminates) depending instructions can execute and that frees positions in the queues. But if no termination occurs, and one of these queues is full, the functional unit associated with this queue stalls (this can also happen at instruction issue if the processor is missing register names). Stalls are what creates performance loss and to avoid it, queue filling must be limited.

This explains the difference between linear and random memory accesses.
In a linear access, 1/ the number of misses will be smaller because of the better spatial locality and because caches can prefetch accesses with a regular pattern to reduce it further and 2/ whenever a read terminates, it will concern a complete cache line and can free several pending load instructions limiting the filling of instructions queues. This ways the processor is permanently busy and memory latency is hidden.
For a random access, the number of misses will be higher, and only a single load can be served when data arrives. Hence instructions queues will saturate rapidly, the processor stalls and memory latency can no longer be hidden by executing other instructions.

The processor architecture must be balanced in terms of throughput in order to avoid queue saturation and stalls. Indeed there are be generally tens of instructions at some stage of execution in a processor and global throughput (ie the ability to serve instruction requests by the memory (or functional units)) is the main factor that will determine performances. The fact than some of these pending instructions are waiting for a memory value has a minor effect...

...except if you have long dependency chains.

There is a dependency when an instruction has to wait for the completion of a previous one. Using the result of a read is a dependency. And dependencies can be a problem when involved in a dependency chain.

For instance, consider the code for i in range(1,100000): s += a[i]. All the memory reads are independent, but there is a dependency chain for the accumulation in s. No addition can happen until the previous one has terminated. These dependencies will make the reservation stations rapidly filled and create stalls in the pipeline.

But reads are rarely involved in dependency chains. It is still possible to imagine pathological code where all reads are dependent of the previous one (for instance for i in range(1,100000): s = a[s]), but they are uncommon in real code. And the problem comes from the dependency chain, not from the fact that it is a read; the situation would be similar (and even probably worse) with compute bound dependent code like for i in range(1,100000): x = 1.0/x+1.0.

Hence, except in some situations, computation time is more related to throughput than to read dependency, thanks to the fact that superscalar out or order execution hides latency. And for what concerns throughput, writes are worse then reads.

Reason #2: Memory writes (especially random ones) are more expensive than memory reads

This is related to the way caches behave. Cache are fast memory that store a part of the memory (called a line) by the processor. Cache lines are presently 64 bytes and allow to exploit spatial locality of memory references: once a line is stored, all data in the line are immediately available. The important aspect here is that all transfers between the cache and the memory are lines.

When a processor performs a read on a data, the cache checks if the line to which the data belongs is in the cache. If not, the line is fetched from memory, stored in the cache and the desired data is sent back to the processor.

When a processor writes a data to memory, the cache also checks for the line presence. If the line is not present, the cache cannot send its data to memory (because all transfers are line based) and does the following steps:

  1. cache fetches the line from memory and writes it in the cache line.
  2. data is written in the cache and the complete line is marked as modified (dirty)
  3. when a line is suppressed from the cache, it checks for the modified flag, and if the line has been modified, it writes it back to memory (write back cache)

Hence, every memory write must be preceded by a memory read to get the line in the cache. This adds an extra operation, but is not very expensive for linear writes. There will be a cache miss and a memory read for the first written word, but successive writes will just concern the cache and be hits.

But the situation is very different for random writes. If the number of misses is important, every cache miss implies a read followed by only a small number of writes before the line is ejected from the cache, which significantly increases write cost. If a line is ejected after a single write, we can even consider that a write is twice the temporal cost of a read.

It is important to note that increasing the number of memory accesses (either reads or writes) tends to saturate the memory access path and to globally slow down all transfers between the processor and memory.

In either case, writes are always more expensive than reads. And multicores augment this aspect.

Reason #3: Random writes create cache misses in multicores

Not sure this really applies to the situation of the question. While numpy BLAS routines are multithreaded, I do not think basic array copy is. But it is closely related and is another reason why writes are more expensive.

The problem with multicores is to ensure proper cache coherence in such a way that a data shared by several processors is properly updated in the cache of every core. This is done by mean of a protocol such as MESI that updates a cache line before writing it, and invalidates other cache copies (read for ownership).

While none of the data is actually shared between cores in the question (or a parallel version of it), note that the protocol applies to cache lines. Whenever a cache line is to be modified, it is copied from the cache holding the most recent copy, locally updated and all other copies are invalidated. Even if cores are accessing different parts of the cache line. Such a situation is called a false sharing and it is an important issue for multicore programming.

Concerning the problem of random writes, cache lines are 64 bytes and can hold 8 int64, and if the computer has 8 cores, every core will process on the average 2 values. Hence there is an important false sharing that will slow down writes.


We did some performance evaluations. It was performed in C in order to include an evaluation of the impact of parallelization. We compared 5 functions that process int64 arrays of size N.

  1. Just a copy of b to c (c[i] = b[i]) (implemented by the compiler with memcpy())

  2. Copy with a linear index c[i] = b[d[i]] where d[i]==i (read_linear)

  3. Copy with a random index c[i] = b[a[i]] where a is a random permutation of 0..N-1 (read_random is equivalent to fwd in the original question)

  4. Write linear c[d[i]] = b[i] where d[i]==i (write_linear)

  5. Write random c[a[i]] = b[i] with a random permutation of 0..N-1 (write_random is equivalent to inv in the question)

Code has been compiled with gcc -O3 -funroll-loops -march=native -malign-double on a skylake processor. Performances are measured with _rdtsc() and are given in cycles per iteration. The function are executed several times (1000-20000 depending on array size), 10 experiments are performed and the smallest time is kept.

Array sizes range from 4000 to 1200000. All code has been measured with a sequential and a parallel version with openmp.

Here is a graph of the results. Functions are with different colors, with the sequential version in thick lines and the parallel one with thin ones.

Direct copy is (obviously) the fastest and is implemented by gcc with the highly optimized memcpy(). It is a mean to get an estimation of data throughput with memory. It ranges from 0.8 cycles per iteration (CPI) for small matrices to 2.0 CPI for large ones.

Read linear performances are approximately twice longer than memcpy, but there are 2 reads and a write, vs 1 read and a write for the direct copy. More the index adds some dependency. Min value is 1.56 CPI and max value 3.8 CPI. Write linear is slightly longer (5-10%).

Reads and writes with a random index are the purpose of the original question and deserve a longer comments. Here are the results.

size    4000    6000    9000    13496   20240   30360   45536   68304   102456  153680  230520  345776  518664  777992  1166984
rd-rand 1.86821 2.52813 2.90533 3.50055 4.69627 5.10521 5.07396 5.57629 6.13607 7.02747 7.80836 10.9471 15.2258 18.5524 21.3811
wr-rand 7.07295 7.21101 7.92307 7.40394 8.92114 9.55323 9.14714 8.94196 8.94335 9.37448 9.60265 11.7665 15.8043 19.1617 22.6785

  • small values (<10k): L1 cache is 32k and can hold a 4k array of uint64. Note, that due to the randomness of the index, after ~1/8 of iterations L1 cache will be completely filled with values of the random index array (as cache lines are 64 bytes and can hold 8 array elements). Accesses to the other linear arrays we will rapidly generate many L1 misses and we have to use the L2 cache. L1 cache access is 5 cycles, but it is pipelined and can serve a couple of values per cycle. L2 access is longer and requires 12 cycles. The amount of misses is similar for random reads and writes, but we see than we fully pay the double access required for writes when array size is small.

  • medium values (10k-100k): L2 cache is 256k and it can hold a 32k int64 array. After that, we need to go to L3 cache (12Mo). As size increases, the number of misses in L1 and L2 increases and the computation time accordingly. Both algorithms have a similar number of misses, mostly due to random reads or writes (other accesses are linear and can be very efficiently prefetched by the caches). We retrieve the factor two between random reads and writes already noted in B.M. answer. It can be partly explained by the double cost of writes.

  • large values (>100k): the difference between methods is progressively reduced. For these sizes, a large part of information is stored in L3 cache. L3 size is sufficient to hold a full array of 1.5M and lines are less likely to be ejected. Hence, for writes, after the initial read, a larger number of writes can be done without line ejection, and the relative cost of writes vs read is reduced. For these large sizes, there are also many other factors that need to be considered. For instance, caches can only serve a limited number of misses (typ. 16) and when the number of misses is large, this may be the limiting factor.

One word on parallel omp version of random reads and writes. Except for small sizes, where having the random index array spread over several caches may not be an advantage, they are systematically ~ twice faster. For large sizes, we clearly see that the gap between random reads and writes increases due to false sharing.

It is almost impossible to do quantitative predictions with the complexity of present computer architectures, even for simple code, and even qualitative explanations of the behaviour are difficult and must take into account many factors. As mentioned in other answers, software aspects related to python can also have an impact. But, while it may happen in some situations, most of the time, one cannot consider that reads are more expensive because of data dependency.

这篇关于为什么线性读混排写入不比随机读线性写入快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆