大内存映射数组的高效点积 [英] Efficient dot products of large memory-mapped arrays

查看:47
本文介绍了大内存映射数组的高效点积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些相当大的、密集的 numpy 浮点数组,这些数组当前驻留在 PyTables CArray 中的磁盘上.我需要能够使用这些数组执行高效的点积,例如 C = A.dot(B),其中 A 是一个巨大的 (~1E4 x 3E5 float32) 内存映射数组,BC 是驻留在核心内存中的较小的 numpy 数组.

I'm working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArrays. I need to be able to perform efficient dot products using these arrays, for example C = A.dot(B), where A is a huge (~1E4 x 3E5 float32) memory-mapped array, and B and C are smaller numpy arrays that are resident in core memory.

我目前正在做的是使用 np.memmap 将数据复制到内存映射的 numpy 数组中,然后直接在内存上调用 np.dot -映射数组.这有效,但我怀疑标准 np.dot(或者更确切地说它调用的底层 BLAS 函数)在计算所需的 I/O 操作数量方面可能不是很有效结果.

What I'm doing at the moment is copying the data into memory-mapped numpy arrays using np.memmap, then calling np.dot directly on the memory-mapped arrays. This works, but I suspect that the standard np.dot (or rather the underlying BLAS functions it calls) is probably not very efficient in terms of the number of I/O operations required in order to compute the result.

我在这篇评论文章中遇到了一个有趣的例子.使用 3x 嵌套循环计算的简单点积,如下所示:

I came across an interesting example in this review article. A naive dot product computed using 3x nested loops, like this:

def naive_dot(A, B, C):
    for ii in xrange(n):
        for jj in xrange(n):
            C[ii,jj] = 0
            for kk in xrange(n):
                C[ii,jj] += A[ii,kk]*B[kk,jj]
    return C

需要 O(n^3) 个 I/O 操作来计算.

requires O(n^3) I/O operations to compute.

但是,通过在适当大小的块中处理数组:

However, by processing the arrays in appropriately-sized blocks:

def block_dot(A, B, C, M):
    b = sqrt(M / 3)
    for ii in xrange(0, n, b):
        for jj in xrange(0, n, b):
            C[ii:ii+b,jj:jj+b] = 0
            for kk in xrange(0, n, b):
                C[ii:ii+b,jj:jj+b] += naive_dot(A[ii:ii+b,kk:kk+b], 
                                                B[kk:kk+b,jj:jj+b],
                                                C[ii:ii+b,jj:jj+b])
    return C

其中 M 是适合核心内存的最大元素数,I/O 操作的数量减少到 O(n^3/sqrt(M)).

where M is the maximum number of elements that will fit into core memory, the number of I/O operations is reduced to O(n^3 / sqrt(M)).

np.dot 和/或 np.memmap 有多聪明?调用 np.dot 是否会执行 I/O 高效的块状点积?np.memmap 是否做了任何可以提高此类操作效率的奇特缓存?

How smart is np.dot and/or np.memmap? Does calling np.dot perform an I/O-efficient blockwise dot product? Does np.memmap do any fancy caching that would improve the efficiency of this type of operation?

如果没有,是否有一些预先存在的库函数可以执行 I/O 高效的点积,还是应该自己尝试实现?

If not, is there some pre-existing library function that performs I/O efficient dot products, or should I try and implement it myself?

我已经对 np.dot 的手动实现做了一些基准测试,该实现对输入数组的块进行操作,这些块被显式读入核心内存.这些数据至少部分解决了我最初的问题,所以我将其发布为答案.

I've done some benchmarking with a hand-rolled implementation of np.dot that operates on blocks of the input array, which are explicitly read into core memory. This data at least a partially addresses my original question, so I'm posting it as an answer.

推荐答案

我实现了一个函数,用于将 np.dot 应用于从内存映射数组显式读入核心内存的块:

I've implemented a function for applying np.dot to blocks that are explicitly read into core memory from the memory-mapped array:

import numpy as np

def _block_slices(dim_size, block_size):
    """Generator that yields slice objects for indexing into 
    sequential blocks of an array along a particular axis
    """
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count > dim_size:
            raise StopIteration

def blockwise_dot(A, B, max_elements=int(2**27), out=None):
    """
    Computes the dot product of two matrices in a block-wise fashion. 
    Only blocks of `A` with a maximum size of `max_elements` will be 
    processed simultaneously.
    """

    m,  n = A.shape
    n1, o = B.shape

    if n1 != n:
        raise ValueError('matrices are not aligned')

    if A.flags.f_contiguous:
        # prioritize processing as many columns of A as possible
        max_cols = max(1, max_elements / m)
        max_rows =  max_elements / max_cols

    else:
        # prioritize processing as many rows of A as possible
        max_rows = max(1, max_elements / n)
        max_cols =  max_elements / max_rows

    if out is None:
        out = np.empty((m, o), dtype=np.result_type(A, B))
    elif out.shape != (m, o):
        raise ValueError('output array has incorrect dimensions')

    for mm in _block_slices(m, max_rows):
        out[mm, :] = 0
        for nn in _block_slices(n, max_cols):
            A_block = A[mm, nn].copy()  # copy to force a read
            out[mm, :] += np.dot(A_block, B[nn, :])
            del A_block

    return out

然后我做了一些基准测试,将我的 blockwise_dot 函数与直接应用于内存映射数组的普通 np.dot 函数进行比较(参见下面的基准测试脚本).我正在使用 numpy 1.9.0.dev-205598b 链接到 OpenBLAS v0.2.9.rc1(从源代码编译).这台机器是一台运行 Ubuntu 13.10、8GB RAM 和 SSD 的四核笔记本电脑,我已经禁用了交换文件.

I then did some benchmarking to compare my blockwise_dot function to the normal np.dot function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.

正如@Bi Rico 预测的那样,相对于 A 的维度,计算点积所花费的时间非常O(n).与仅在整个内存映射数组上调用普通 np.dot 函数相比,对 A 的缓存块进行操作带来了巨大的性能提升:

As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A. Operating on cached blocks of A gives a huge performance improvement over just calling the normal np.dot function on the whole memory-mapped array:

令人惊讶的是,它对正在处理的块的大小不敏感 - 处理 1GB、2GB 或 4GB 块中的阵列所花费的时间几乎没有区别.我得出的结论是,无论缓存 np.memmap 数组本身实现了什么,它似乎都不太适合计算点积.

It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap arrays natively implement, it seems to be very suboptimal for computing dot products.

手动实现这种缓存策略仍然有点痛苦,因为我的代码可能必须在具有不同物理内存量和可能不同操作系统的机器上运行.出于这个原因,我仍然对是否有办法控制内存映射数组的缓存行为以提高 np.dot 的性能感兴趣.

It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot.

我在运行基准测试时注意到一些奇怪的内存处理行为 - 当我在整个 A 上调用 np.dot 时,我从来没有看到常驻集大小我的 Python 进程超过了大约 3.8GB,即使我有大约 7.5GB 的可用 RAM.这让我怀疑 np.memmap 数组允许占用的物理内存量有一些限制 - 我之前假设它会使用操作系统允许它获取的任何 RAM.就我而言,能够增加此限制可能非常有益.

I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot on the whole of A I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.

有没有人对 np.memmap 数组的缓存行为有任何进一步的了解来帮助解释这一点?

Does anyone have any further insight into the caching behaviour of np.memmap arrays that would help to explain this?

def generate_random_mmarray(shape, fp, max_elements):
    A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
    max_rows = max(1, max_elements / shape[1])
    max_cols =  max_elements / max_rows
    for rr in _block_slices(shape[0], max_rows):
        for cc in _block_slices(shape[1], max_cols):
            A[rr, cc] = np.random.randn(*A[rr, cc].shape)
    return A

def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
              fpath='temp_array'):
    """
    time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
    (n, o) arrays resident in core memory
    """

    standard_times = []
    blockwise_times = []
    differences = []
    nbytes = n_gigabytes * 2 ** 30
    o = 64

    # float32 elements
    max_elements = int((max_block_gigabytes * 2 ** 30) / 4)

    for nb in nbytes:

        # float32 elements
        n = int(np.sqrt(nb / 4))

        with open(fpath, 'w+') as f:
            A = generate_random_mmarray((n, n), f, (max_elements / 2))
            B = np.random.randn(n, o).astype(np.float32)

            print "\n" + "-"*60
            print "A: %s\t(%i bytes)" %(A.shape, A.nbytes)
            print "B: %s\t\t(%i bytes)" %(B.shape, B.nbytes)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res1 = np.dot(A, B)
                t = time.time() - tic
                best = min(best, t)
            print "Normal dot:\t%imin %.2fsec" %divmod(best, 60)
            standard_times.append(best)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res2 = blockwise_dot(A, B, max_elements=max_elements)
                t = time.time() - tic
                best = min(best, t)
            print "Block-wise dot:\t%imin %.2fsec" %divmod(best, 60)
            blockwise_times.append(best)

            diff = np.linalg.norm(res1 - res2)
            print "L2 norm of difference:\t%g" %diff
            differences.append(diff)

        del A, B
        del res1, res2
        os.remove(fpath)

    return (np.array(standard_times), np.array(blockwise_times), 
            np.array(differences))

if __name__ == '__main__':
    n = np.logspace(2,5,4,base=2)
    standard_times, blockwise_times, differences = run_bench(
                                                    n_gigabytes=n,
                                                    max_block_gigabytes=4)

    np.savez('bench_results', standard_times=standard_times, 
             blockwise_times=blockwise_times, differences=differences)

这篇关于大内存映射数组的高效点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆