广播的NumPy算术-为什么一种方法的性能如此之高? [英] Broadcasted NumPy arithmetic - why is one method so much more performant?

查看:79
本文介绍了广播的NumPy算术-为什么一种方法的性能如此之高?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是我对以下问题的回答的后续行动 计算范德蒙德矩阵的有效方法.. >

这是设置:

x = np.arange(5000)  # an integer array
N = 4

现在,我将以两种不同的方式计算 Vandermonde矩阵:

m1 = (x ** np.arange(N)[:, None]).T

然后

m2 = x[:, None] ** np.arange(N)

健全性检查:

np.array_equal(m1, m2)
True

这些方法是相同的,但是它们的性能不同:

%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

因此,尽管第二种方法最后需要进行换位,但仍然比第二种方法快3倍以上.

唯一的区别是,第一种情况是广播较小数组,而第二种情况是广播较大.

因此,在对numpy的工作原理有相当不错的理解之后,我猜得出答案 将涉及缓存.第一种方法对缓存更友好 比第二.但是,我希望有人 比我更多的经验.

在时间上形成鲜明对比的原因可能是什么?

解决方案

虽然我的结论恐怕不会比您的结论更根本(可能是缓存"),但我相信我可以通过更加本地化的测试集.

考虑您的示例问题:

M,N = 5000,4
x1 = np.arange(M)
y1 = np.arange(N)[:,None]
x2 = np.arange(M)[:,None]
y2 = np.arange(N)
x1_bc,y1_bc = np.broadcast_arrays(x1,y1)
x2_bc,y2_bc = np.broadcast_arrays(x2,y2)
x1_cont,y1_cont,x2_cont,y2_cont = map(np.ascontiguousarray,
                                      [x1_bc,y1_bc,x2_bc,y2_bc])

如您所见,我定义了一堆要比较的数组. x1y1x2y2分别对应于您的原始测试用例. ??_bc对应于这些阵列的显式广播版本.它们与原始数据共享数据,但是为了获得适当的形状,它们具有明确的0跨度.最后,??_cont是这些广播阵列的连续版本,就像用np.tile构造的一样.

因此x1_bcy1_bcx1_conty1_cont的形状均为(4, 5000),但是前两个的步幅为零,后两个的则为连续数组.出于所有意图和目的,采用这些对应数组对中的任何一个的威力应该给我们带来相同的连续结果(正如hpaulj在评论中指出的,换位本身本质上是免费的,因此我将忽略其中最外面的换位以下).

以下是与您的原始支票相对应的时间:

In [143]: %timeit x1 ** y1
     ...: %timeit x2 ** y2
     ...: 
52.2 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

以下是显式广播阵列的时间:

In [144]: %timeit x1_bc ** y1_bc
     ...: %timeit x2_bc ** y2_bc
     ...: 
54.1 µs ± 906 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each

同一件事.这告诉我,差异并不是由于从索引表达式到广播数组的过渡造成的.这是大多数人所期望的,但检查不会有任何伤害.

最后,连续的数组:

In [146]: %timeit x1_cont ** y1_cont
     ...: %timeit x2_cont ** y2_cont
     ...: 
38.9 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.6 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

很大一部分差异消失了!

那我为什么要检查这个?有一条一般的经验法则,如果您在python的大尾随维度上使用矢量化操作,则可以使用CPU缓存.更具体地说,对于行大("C顺序")阵列,尾随尺寸是连续的,而对于列大("fortran顺序")阵列,前导尺寸是连续的.对于足够大的尺寸,arr.sum(axis=-1)应该比arr.sum(axis=0)更快,以便行主要的numpy数组可以打印出或打印出一些精美的图片.

这里发生的是两个维度之间的差异很大(分别为4和5000),但是两个转置案例之间的巨大性能不对称仅发生在广播案例中.我公认的挥舞印象是广播使用0跨度来构建适当大小的视图.这些0步表示在更快的情况下,对于长x数组,内存访问看起来像这样:

[mem0,mem1,mem2,...,mem4999, mem0,mem1,mem2,...,mem4999, ...] # and so on

其中,mem*仅表示xx值,位于RAM中的某个位置.将此与我们使用形状(5000,4)的较慢情况进行比较:

[mem0,mem0,mem0,mem0, mem1,mem1,mem1,mem1, mem2,mem2,mem2,mem2, ...]

我幼稚的想法是,与前者一起使用可使CPU一次缓存x的各个值的较大块,因此性能很好.在后一种情况下,0跨度使CPU在相同的内存地址x上各跳四次,进行5000次.我发现有理由相信此设置会影响缓存,从而导致整体性能下降,这是合理的.这也符合以下事实:连续的情况不会显示出这种性能下降:那里的CPU必须使用所有5000 * 4个唯一的float64值,并且这些奇怪的读取可能不会阻止缓存. /p>

This question is a follow up to my answer in Efficient way to compute the Vandermonde matrix.

Here's the setup:

x = np.arange(5000)  # an integer array
N = 4

Now, I'll compute the Vandermonde matrix in two different ways:

m1 = (x ** np.arange(N)[:, None]).T

And,

m2 = x[:, None] ** np.arange(N)

Sanity check:

np.array_equal(m1, m2)
True

These methods are identical, but their performance is not:

%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So, the first method, despite requiring a transposition at the end, is still over 3X faster than the second method.

The only difference is that in the first case, the smaller array is broadcasted, whereas with the second case, it is the larger.

So, with a fairly decent understanding of how numpy works, I can guess that the answer would involve the cache. The first method is a lot more cache friendly than the second. However, I'd like an official word from someone with more experience than me.

What could be the reason for this stark contrast in timings?

解决方案

While I'm afraid my conclusion won't be more fundamental than yours ("probably caching"), I believe I can help on focusing our attention with a set of more localized tests.

Consider your example problem:

M,N = 5000,4
x1 = np.arange(M)
y1 = np.arange(N)[:,None]
x2 = np.arange(M)[:,None]
y2 = np.arange(N)
x1_bc,y1_bc = np.broadcast_arrays(x1,y1)
x2_bc,y2_bc = np.broadcast_arrays(x2,y2)
x1_cont,y1_cont,x2_cont,y2_cont = map(np.ascontiguousarray,
                                      [x1_bc,y1_bc,x2_bc,y2_bc])

As you can see, I defined a bunch of arrays to compare. x1, y1 and x2, y2, respectively, correspond to your original test cases. ??_bc correspond to explicitly broadcast versions of these arrays. These share the data with the original ones, but they have explicit 0-strides in order to get the appropriate shape. Finally, ??_cont are contiguous versions of these broadcast arrays, as if constructed with np.tile.

So both x1_bc, y1_bc, x1_cont and y1_cont have shape (4, 5000), but while the former two have zero-strides, the latter two are contiguous arrays. For all intents and purposes taking the power of any of these corresponding pairs of arrays should give us the same contiguous result (as hpaulj noted in a comment, a transposition itself is essentially for free, so I'm going to ignore that outermost transpose in the following).

Here are the timings corresponding to your original check:

In [143]: %timeit x1 ** y1
     ...: %timeit x2 ** y2
     ...: 
52.2 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here are the timings for the explicitly broadcast arrays:

In [144]: %timeit x1_bc ** y1_bc
     ...: %timeit x2_bc ** y2_bc
     ...: 
54.1 µs ± 906 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each

Same thing. This tells me that the discrepancy isn't somehow due to the transition from the indexed expressions to the broadcast arrays. This was mostly expected, but it never hurts to check.

Finally, the contiguous arrays:

In [146]: %timeit x1_cont ** y1_cont
     ...: %timeit x2_cont ** y2_cont
     ...: 
38.9 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.6 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A huge part of the discrepancy goes away!

So why did I check this? There is a general rule of thumb that you can work with CPU caching if you use vectorized operations along large trailing dimensions in python. To be more specific, for row-major ("C-order") arrays trailing dimensions are contiguous, while for column-major ("fortran-order") arrays the leading dimensions are contiguous. For large enough dimensions arr.sum(axis=-1) should be faster than arr.sum(axis=0) for row-major numpy arrays give or take some fine print.

What happens here is that there is a huge difference between the two dimensions (size 4 and 5000, respectively), but the huge performance asymmetry between the two transposed cases only happens for the broadcasting case. My admittedly handwaving impression is that broadcasting uses 0-strides to construct views of appropriate size. These 0-strides imply that in the faster case memory access looks like this for the long x array:

[mem0,mem1,mem2,...,mem4999, mem0,mem1,mem2,...,mem4999, ...] # and so on

where mem* just denotes a float64 value of x sitting somewhere in RAM. Compare this to the slower case where we're working with shape (5000,4):

[mem0,mem0,mem0,mem0, mem1,mem1,mem1,mem1, mem2,mem2,mem2,mem2, ...]

My naive notion is that working with the former allows the CPU to cache larger chunks of the individual values of x at a time, so performance is great. In the latter case the 0-strides make the CPU hop around on the same memory address of x four times each, doing this 5000 times. I find it reasonable to believe that this setup works against caching, leading to overall bad performance. This would also be in agreement with the fact that the contiguous cases don't show this performance hit: there the CPU has to work with all 5000*4 unique float64 values, and caching might not be impeded by these weird reads.

这篇关于广播的NumPy算术-为什么一种方法的性能如此之高?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆