大内存映射数组的高效点积 [英] Efficient dot products of large memory-mapped arrays
问题描述
我正在处理一些相当大的、密集的 numpy 浮点数组,这些数组当前驻留在 PyTables CArray
中的磁盘上.我需要能够使用这些数组执行高效的点积,例如 C = A.dot(B)
,其中 A
是一个巨大的 (~1E4 x 3E5 float32) 内存映射数组,B
和 C
是驻留在核心内存中的较小的 numpy 数组.
I'm working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArray
s. I need to be able to perform efficient dot products using these arrays, for example C = A.dot(B)
, where A
is a huge (~1E4 x 3E5 float32) memory-mapped array, and B
and C
are smaller numpy arrays that are resident in core memory.
我目前正在做的是使用 np.memmap
将数据复制到内存映射的 numpy 数组中,然后直接在内存上调用 np.dot
-映射数组.这有效,但我怀疑标准 np.dot
(或者更确切地说它调用的底层 BLAS 函数)在计算所需的 I/O 操作数量方面可能不是很有效结果.
What I'm doing at the moment is copying the data into memory-mapped numpy arrays using np.memmap
, then calling np.dot
directly on the memory-mapped arrays. This works, but I suspect that the standard np.dot
(or rather the underlying BLAS functions it calls) is probably not very efficient in terms of the number of I/O operations required in order to compute the result.
我在这篇评论文章中遇到了一个有趣的例子.使用 3x 嵌套循环计算的简单点积,如下所示:
I came across an interesting example in this review article. A naive dot product computed using 3x nested loops, like this:
def naive_dot(A, B, C):
for ii in xrange(n):
for jj in xrange(n):
C[ii,jj] = 0
for kk in xrange(n):
C[ii,jj] += A[ii,kk]*B[kk,jj]
return C
需要 O(n^3) 个 I/O 操作来计算.
requires O(n^3) I/O operations to compute.
但是,通过在适当大小的块中处理数组:
However, by processing the arrays in appropriately-sized blocks:
def block_dot(A, B, C, M):
b = sqrt(M / 3)
for ii in xrange(0, n, b):
for jj in xrange(0, n, b):
C[ii:ii+b,jj:jj+b] = 0
for kk in xrange(0, n, b):
C[ii:ii+b,jj:jj+b] += naive_dot(A[ii:ii+b,kk:kk+b],
B[kk:kk+b,jj:jj+b],
C[ii:ii+b,jj:jj+b])
return C
其中 M
是适合核心内存的最大元素数,I/O 操作的数量减少到 O(n^3/sqrt(M)).
where M
is the maximum number of elements that will fit into core memory, the number of I/O operations is reduced to O(n^3 / sqrt(M)).
np.dot
和/或 np.memmap
有多聪明?调用 np.dot
是否会执行 I/O 高效的块状点积?np.memmap
是否做了任何可以提高此类操作效率的奇特缓存?
How smart is np.dot
and/or np.memmap
? Does calling np.dot
perform an I/O-efficient blockwise dot product? Does np.memmap
do any fancy caching that would improve the efficiency of this type of operation?
如果没有,是否有一些预先存在的库函数可以执行 I/O 高效的点积,还是应该自己尝试实现?
If not, is there some pre-existing library function that performs I/O efficient dot products, or should I try and implement it myself?
我已经对 np.dot
的手动实现做了一些基准测试,该实现对输入数组的块进行操作,这些块被显式读入核心内存.这些数据至少部分解决了我最初的问题,所以我将其发布为答案.
I've done some benchmarking with a hand-rolled implementation of np.dot
that operates on blocks of the input array, which are explicitly read into core memory. This data at least a partially addresses my original question, so I'm posting it as an answer.
推荐答案
我实现了一个函数,用于将 np.dot
应用于从内存映射数组显式读入核心内存的块:
I've implemented a function for applying np.dot
to blocks that are explicitly read into core memory from the memory-mapped array:
import numpy as np
def _block_slices(dim_size, block_size):
"""Generator that yields slice objects for indexing into
sequential blocks of an array along a particular axis
"""
count = 0
while True:
yield slice(count, count + block_size, 1)
count += block_size
if count > dim_size:
raise StopIteration
def blockwise_dot(A, B, max_elements=int(2**27), out=None):
"""
Computes the dot product of two matrices in a block-wise fashion.
Only blocks of `A` with a maximum size of `max_elements` will be
processed simultaneously.
"""
m, n = A.shape
n1, o = B.shape
if n1 != n:
raise ValueError('matrices are not aligned')
if A.flags.f_contiguous:
# prioritize processing as many columns of A as possible
max_cols = max(1, max_elements / m)
max_rows = max_elements / max_cols
else:
# prioritize processing as many rows of A as possible
max_rows = max(1, max_elements / n)
max_cols = max_elements / max_rows
if out is None:
out = np.empty((m, o), dtype=np.result_type(A, B))
elif out.shape != (m, o):
raise ValueError('output array has incorrect dimensions')
for mm in _block_slices(m, max_rows):
out[mm, :] = 0
for nn in _block_slices(n, max_cols):
A_block = A[mm, nn].copy() # copy to force a read
out[mm, :] += np.dot(A_block, B[nn, :])
del A_block
return out
然后我做了一些基准测试,将我的 blockwise_dot
函数与直接应用于内存映射数组的普通 np.dot
函数进行比较(参见下面的基准测试脚本).我正在使用 numpy 1.9.0.dev-205598b 链接到 OpenBLAS v0.2.9.rc1(从源代码编译).这台机器是一台运行 Ubuntu 13.10、8GB RAM 和 SSD 的四核笔记本电脑,我已经禁用了交换文件.
I then did some benchmarking to compare my blockwise_dot
function to the normal np.dot
function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.
正如@Bi Rico 预测的那样,相对于 A
的维度,计算点积所花费的时间非常O(n).与仅在整个内存映射数组上调用普通 np.dot
函数相比,对 A
的缓存块进行操作带来了巨大的性能提升:
As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A
. Operating on cached blocks of A
gives a huge performance improvement over just calling the normal np.dot
function on the whole memory-mapped array:
令人惊讶的是,它对正在处理的块的大小不敏感 - 处理 1GB、2GB 或 4GB 块中的阵列所花费的时间几乎没有区别.我得出的结论是,无论缓存 np.memmap
数组本身实现了什么,它似乎都不太适合计算点积.
It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap
arrays natively implement, it seems to be very suboptimal for computing dot products.
手动实现这种缓存策略仍然有点痛苦,因为我的代码可能必须在具有不同物理内存量和可能不同操作系统的机器上运行.出于这个原因,我仍然对是否有办法控制内存映射数组的缓存行为以提高 np.dot
的性能感兴趣.
It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot
.
我在运行基准测试时注意到一些奇怪的内存处理行为 - 当我在整个 A
上调用 np.dot
时,我从来没有看到常驻集大小我的 Python 进程超过了大约 3.8GB,即使我有大约 7.5GB 的可用 RAM.这让我怀疑 np.memmap
数组允许占用的物理内存量有一些限制 - 我之前假设它会使用操作系统允许它获取的任何 RAM.就我而言,能够增加此限制可能非常有益.
I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot
on the whole of A
I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap
array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.
有没有人对 np.memmap
数组的缓存行为有任何进一步的了解来帮助解释这一点?
Does anyone have any further insight into the caching behaviour of np.memmap
arrays that would help to explain this?
def generate_random_mmarray(shape, fp, max_elements):
A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
max_rows = max(1, max_elements / shape[1])
max_cols = max_elements / max_rows
for rr in _block_slices(shape[0], max_rows):
for cc in _block_slices(shape[1], max_cols):
A[rr, cc] = np.random.randn(*A[rr, cc].shape)
return A
def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
fpath='temp_array'):
"""
time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
(n, o) arrays resident in core memory
"""
standard_times = []
blockwise_times = []
differences = []
nbytes = n_gigabytes * 2 ** 30
o = 64
# float32 elements
max_elements = int((max_block_gigabytes * 2 ** 30) / 4)
for nb in nbytes:
# float32 elements
n = int(np.sqrt(nb / 4))
with open(fpath, 'w+') as f:
A = generate_random_mmarray((n, n), f, (max_elements / 2))
B = np.random.randn(n, o).astype(np.float32)
print "\n" + "-"*60
print "A: %s\t(%i bytes)" %(A.shape, A.nbytes)
print "B: %s\t\t(%i bytes)" %(B.shape, B.nbytes)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res1 = np.dot(A, B)
t = time.time() - tic
best = min(best, t)
print "Normal dot:\t%imin %.2fsec" %divmod(best, 60)
standard_times.append(best)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res2 = blockwise_dot(A, B, max_elements=max_elements)
t = time.time() - tic
best = min(best, t)
print "Block-wise dot:\t%imin %.2fsec" %divmod(best, 60)
blockwise_times.append(best)
diff = np.linalg.norm(res1 - res2)
print "L2 norm of difference:\t%g" %diff
differences.append(diff)
del A, B
del res1, res2
os.remove(fpath)
return (np.array(standard_times), np.array(blockwise_times),
np.array(differences))
if __name__ == '__main__':
n = np.logspace(2,5,4,base=2)
standard_times, blockwise_times, differences = run_bench(
n_gigabytes=n,
max_block_gigabytes=4)
np.savez('bench_results', standard_times=standard_times,
blockwise_times=blockwise_times, differences=differences)
这篇关于大内存映射数组的高效点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!