大内存映射阵列的有效点积 [英] Efficient dot products of large memory-mapped arrays
问题描述
我与一些相当大的,密集的numpy的浮动,目前驻留在PyTables CARRAY
■磁盘阵列工作。我需要能够使用这些阵列来进行高效率的点产品,例如 C = A.dot(B)
,其中 A
是一个巨大的(〜1E4点¯x3E5 FLOAT32)内存映射数组,而 B
和 C
较小numpy的数组,其驻留在核心显存。
I'm working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArray
s. I need to be able to perform efficient dot products using these arrays, for example C = A.dot(B)
, where A
is a huge (~1E4 x 3E5 float32) memory-mapped array, and B
and C
are smaller numpy arrays that are resident in core memory.
什么,我现在在做使用 np.memmap
正在复制数据到内存映射numpy的数组,然后调用 np.dot
直接在内存映射阵列。这工作,但我怀疑,该标准 np.dot
(或相反,它调用底层的BLAS功能)可能不是非常高效的I / O操作的数量方面为了计算结果所需
What I'm doing at the moment is copying the data into memory-mapped numpy arrays using np.memmap
, then calling np.dot
directly on the memory-mapped arrays. This works, but I suspect that the standard np.dot
(or rather the underlying BLAS functions it calls) is probably not very efficient in terms of the number of I/O operations required in order to compute the result.
我碰到一个有趣的例子在这篇综述来了。一个天真积3倍使用嵌套循环计算的,就像这样:
I came across an interesting example in this review article. A naive dot product computed using 3x nested loops, like this:
def naive_dot(A, B, C):
for ii in xrange(n):
for jj in xrange(n):
C[ii,jj] = 0
for kk in xrange(n):
C[ii,jj] += A[ii,kk]*B[kk,jj]
return C
需要的为O(n ^ 3)的I / O操作来计算。
requires O(n^3) I/O operations to compute.
然而,通过在适当大小的块处理所述阵列
However, by processing the arrays in appropriately-sized blocks:
def block_dot(A, B, C, M):
b = sqrt(M / 3)
for ii in xrange(0, n, b):
for jj in xrange(0, n, b):
C[ii:ii+b,jj:jj+b] = 0
for kk in xrange(0, n, b):
C[ii:ii+b,jj:jj+b] += naive_dot(A[ii:ii+b,kk:kk+b],
B[kk:kk+b,jj:jj+b],
C[ii:ii+b,jj:jj+b])
return C
,其中 M
是将适合芯存储器单元的最大数目,I / O操作的数目被减少到的为O(n ^ 3 /开方(M))的
where M
is the maximum number of elements that will fit into core memory, the number of I/O operations is reduced to O(n^3 / sqrt(M)).
如何聪明的是 np.dot
和/或 np.memmap
?并调用 np.dot
执行I / O效率的程序段点的产品?请问 np.memmap
做任何花哨的缓存将提高此类操作的效率?
How smart is np.dot
and/or np.memmap
? Does calling np.dot
perform an I/O-efficient blockwise dot product? Does np.memmap
do any fancy caching that would improve the efficiency of this type of operation?
如果不是,是否有执行I / O效率的点积一些pre-现有的库函数,或者我应该尝试并实现它自己?
If not, is there some pre-existing library function that performs I/O efficient dot products, or should I try and implement it myself?
我已经做了一些基准测试与 np.dot
的手卷实现,输入阵列,这是明确的读入核心显存的块进行操作。这个数据至少部分解决了我原来的问题,所以我张贴作为一个答案。
I've done some benchmarking with a hand-rolled implementation of np.dot
that operates on blocks of the input array, which are explicitly read into core memory. This data at least a partially addresses my original question, so I'm posting it as an answer.
推荐答案
我已经实现了一个功能应用 np.dot
来块被明确地读入内存中的核心从存储器映射阵列
I've implemented a function for applying np.dot
to blocks that are explicitly read into core memory from the memory-mapped array:
import numpy as np
def _block_slices(dim_size, block_size):
"""Generator that yields slice objects for indexing into
sequential blocks of an array along a particular axis
"""
count = 0
while True:
yield slice(count, count + block_size, 1)
count += block_size
if count > dim_size:
raise StopIteration
def blockwise_dot(A, B, max_elements=int(2**27), out=None):
"""
Computes the dot product of two matrices in a block-wise fashion.
Only blocks of `A` with a maximum size of `max_elements` will be
processed simultaneously.
"""
m, n = A.shape
n1, o = B.shape
if n1 != n:
raise ValueError('matrices are not aligned')
if A.flags.f_contiguous:
# prioritize processing as many columns of A as possible
max_cols = max(1, max_elements / m)
max_rows = max_elements / max_cols
else:
# prioritize processing as many rows of A as possible
max_rows = max(1, max_elements / n)
max_cols = max_elements / max_rows
if out is None:
out = np.empty((m, o), dtype=np.result_type(A, B))
elif out.shape != (m, o):
raise ValueError('output array has incorrect dimensions')
for mm in _block_slices(m, max_rows):
out[mm, :] = 0
for nn in _block_slices(n, max_cols):
A_block = A[mm, nn].copy() # copy to force a read
out[mm, :] += np.dot(A_block, B[nn, :])
del A_block
return out
然后我做了一些基准测试我的 blockwise_dot
功能比较正常 np.dot
功能直接应用到内存映射阵列(参见下面的基准测试脚本)。我使用numpy的1.9.0.dev-205598b对OpenBLAS v0.2.9.rc1链接(从源代码编译)。该机是一款四核笔记本电脑运行Ubuntu 13.10,有8GB内存和SSD产品,我已禁用交换文件。
I then did some benchmarking to compare my blockwise_dot
function to the normal np.dot
function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.
由于@Bi波多黎各predicted,计算点积所用的时间是美丽的 O(N)的关于 A $ C尺寸$ C>。操作上
A
的缓存块给出了只调用正常 np.dot
功能,整个存储一个巨大的性能改进-mapped数组:
As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A
. Operating on cached blocks of A
gives a huge performance improvement over just calling the normal np.dot
function on the whole memory-mapped array:
这是令人惊讶的不敏感的块的大小被加工 - 那里来处理在1GB,2GB或4GB的块阵列所用的时间之间的差别很小的。我的结论是什么缓存 np.memmap
阵列本身实现,这似乎是计算点公司的产品非常理想。
It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap
arrays natively implement, it seems to be very suboptimal for computing dot products.
它仍然是一个有点疼痛不得不手动执行这种缓存策略,因为我的code可能要与不同数量的物理内存,并有可能不同操作系统的计算机上运行。出于这个原因,我在是否有办法控制,以提高 np.dot
的高性能内存映射阵列的缓存行为仍然有兴趣。
It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot
.
我注意到一些奇怪的内存处理,因为我是运行标准的行为 - 当我叫 np.dot
全 A $的上C $ C>我从来没有见过我的Python进程的驻留集大小超过约3.8GB,即使我有大约可用RAM 7.5GB。这使我怀疑有委予
np.memmap
阵列允许占用的物理内存数量限制一些 - 我有previously认为它会使用任何RAM的操作系统允许它抓住。在我的情况下,它可能是非常有益的,能够增加此限制。
I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot
on the whole of A
I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap
array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.
没有人有任何进一步的深入了解 np.memmap
阵列,这将有助于解释这个的缓存行为?
Does anyone have any further insight into the caching behaviour of np.memmap
arrays that would help to explain this?
def generate_random_mmarray(shape, fp, max_elements):
A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
max_rows = max(1, max_elements / shape[1])
max_cols = max_elements / max_rows
for rr in _block_slices(shape[0], max_rows):
for cc in _block_slices(shape[1], max_cols):
A[rr, cc] = np.random.randn(*A[rr, cc].shape)
return A
def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
fpath='temp_array'):
"""
time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
(n, o) arrays resident in core memory
"""
standard_times = []
blockwise_times = []
differences = []
nbytes = n_gigabytes * 2 ** 30
o = 64
# float32 elements
max_elements = int((max_block_gigabytes * 2 ** 30) / 4)
for nb in nbytes:
# float32 elements
n = int(np.sqrt(nb / 4))
with open(fpath, 'w+') as f:
A = generate_random_mmarray((n, n), f, (max_elements / 2))
B = np.random.randn(n, o).astype(np.float32)
print "\n" + "-"*60
print "A: %s\t(%i bytes)" %(A.shape, A.nbytes)
print "B: %s\t\t(%i bytes)" %(B.shape, B.nbytes)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res1 = np.dot(A, B)
t = time.time() - tic
best = min(best, t)
print "Normal dot:\t%imin %.2fsec" %divmod(best, 60)
standard_times.append(best)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res2 = blockwise_dot(A, B, max_elements=max_elements)
t = time.time() - tic
best = min(best, t)
print "Block-wise dot:\t%imin %.2fsec" %divmod(best, 60)
blockwise_times.append(best)
diff = np.linalg.norm(res1 - res2)
print "L2 norm of difference:\t%g" %diff
differences.append(diff)
del A, B
del res1, res2
os.remove(fpath)
return (np.array(standard_times), np.array(blockwise_times),
np.array(differences))
if __name__ == '__main__':
n = np.logspace(2,5,4,base=2)
standard_times, blockwise_times, differences = run_bench(
n_gigabytes=n,
max_block_gigabytes=4)
np.savez('bench_results', standard_times=standard_times,
blockwise_times=blockwise_times, differences=differences)
这篇关于大内存映射阵列的有效点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!