大内存映射阵列的有效点积 [英] Efficient dot products of large memory-mapped arrays

查看:117
本文介绍了大内存映射阵列的有效点积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我与一些相当大的,密集的numpy的浮动,目前驻留在PyTables CARRAY ■磁盘阵列工作。我需要能够使用这些阵列来进行高效率的点产品,例如 C = A.dot(B),其中 A 是一个巨大的(〜1E4点¯x3E5 FLOAT32)内存映射数组,而 B C 较小numpy的数组,其驻留在核心显存。

I'm working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArrays. I need to be able to perform efficient dot products using these arrays, for example C = A.dot(B), where A is a huge (~1E4 x 3E5 float32) memory-mapped array, and B and C are smaller numpy arrays that are resident in core memory.

什么,我现在在做使用 np.memmap 正在复制数据到内存映射numpy的数组,然后调用 np.dot 直接在内存映射阵列。这工作,但我怀疑,该标准 np.dot (或相反,它调用底层的BLAS功能)可能不是非常高效的I / O操作的数量方面为了计算结果所需

What I'm doing at the moment is copying the data into memory-mapped numpy arrays using np.memmap, then calling np.dot directly on the memory-mapped arrays. This works, but I suspect that the standard np.dot (or rather the underlying BLAS functions it calls) is probably not very efficient in terms of the number of I/O operations required in order to compute the result.

我碰到一个有趣的例子在这篇综述来了。一个天真积3倍使用嵌套循环计算的,就像这样:

I came across an interesting example in this review article. A naive dot product computed using 3x nested loops, like this:

def naive_dot(A, B, C):
    for ii in xrange(n):
        for jj in xrange(n):
            C[ii,jj] = 0
            for kk in xrange(n):
                C[ii,jj] += A[ii,kk]*B[kk,jj]
    return C

需要的为O(n ^ 3)的I / O操作来​​计算。

requires O(n^3) I/O operations to compute.

然而,通过在适当大小的块处理所述阵列

However, by processing the arrays in appropriately-sized blocks:

def block_dot(A, B, C, M):
    b = sqrt(M / 3)
    for ii in xrange(0, n, b):
        for jj in xrange(0, n, b):
            C[ii:ii+b,jj:jj+b] = 0
            for kk in xrange(0, n, b):
                C[ii:ii+b,jj:jj+b] += naive_dot(A[ii:ii+b,kk:kk+b], 
                                                B[kk:kk+b,jj:jj+b],
                                                C[ii:ii+b,jj:jj+b])
    return C

,其中 M 是将适合芯存储器单元的最大数目,I / O操作的数目被减少到的为O(n ^ 3 /开方(M))

where M is the maximum number of elements that will fit into core memory, the number of I/O operations is reduced to O(n^3 / sqrt(M)).

如何聪明的是 np.dot 和/或 np.memmap ?并调用 np.dot 执行I / O效率的程序段点的产品?请问 np.memmap 做任何花哨的缓存将提高此类操作的效率?

How smart is np.dot and/or np.memmap? Does calling np.dot perform an I/O-efficient blockwise dot product? Does np.memmap do any fancy caching that would improve the efficiency of this type of operation?

如果不是,是否有执行I / O效率的点积一些pre-现有的库函数,或者我应该尝试并实现它自己?

If not, is there some pre-existing library function that performs I/O efficient dot products, or should I try and implement it myself?

我已经做了一些基准测试与 np.dot 的手卷实现,输入阵列,这是明确的读入核心显存的块进行操作。这个数据至少部分解决了我原来的问题,所以我张贴作为一个答案。

I've done some benchmarking with a hand-rolled implementation of np.dot that operates on blocks of the input array, which are explicitly read into core memory. This data at least a partially addresses my original question, so I'm posting it as an answer.

推荐答案

我已经实现了一个功能应用 np.dot 来块被明确地读入内存中的核心从存储器映射阵列

I've implemented a function for applying np.dot to blocks that are explicitly read into core memory from the memory-mapped array:

import numpy as np

def _block_slices(dim_size, block_size):
    """Generator that yields slice objects for indexing into 
    sequential blocks of an array along a particular axis
    """
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count > dim_size:
            raise StopIteration

def blockwise_dot(A, B, max_elements=int(2**27), out=None):
    """
    Computes the dot product of two matrices in a block-wise fashion. 
    Only blocks of `A` with a maximum size of `max_elements` will be 
    processed simultaneously.
    """

    m,  n = A.shape
    n1, o = B.shape

    if n1 != n:
        raise ValueError('matrices are not aligned')

    if A.flags.f_contiguous:
        # prioritize processing as many columns of A as possible
        max_cols = max(1, max_elements / m)
        max_rows =  max_elements / max_cols

    else:
        # prioritize processing as many rows of A as possible
        max_rows = max(1, max_elements / n)
        max_cols =  max_elements / max_rows

    if out is None:
        out = np.empty((m, o), dtype=np.result_type(A, B))
    elif out.shape != (m, o):
        raise ValueError('output array has incorrect dimensions')

    for mm in _block_slices(m, max_rows):
        out[mm, :] = 0
        for nn in _block_slices(n, max_cols):
            A_block = A[mm, nn].copy()  # copy to force a read
            out[mm, :] += np.dot(A_block, B[nn, :])
            del A_block

    return out

然后我做了一些基准测试我的 blockwise_dot 功能比较正常 np.dot 功能直接应用到内存映射阵列(参见下面的基准测试脚本)。我使用numpy的1.9.0.dev-205598b对OpenBLAS v0.2.9.rc1链接(从源代码编译)。该机是一款四核笔记本电脑运行Ubuntu 13.10,有8GB内存和SSD产品,我已禁用交换文件。

I then did some benchmarking to compare my blockwise_dot function to the normal np.dot function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.

由于@Bi波多黎各predicted,计算点积所用的时间是美丽的 O(N)的关于 A 。操作上 A 的缓存块给出了只调用正常 np.dot 功能,整个存储一个巨大的性能改进-mapped数组:

As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A. Operating on cached blocks of A gives a huge performance improvement over just calling the normal np.dot function on the whole memory-mapped array:

这是令人惊讶的不敏感的块的大小被加工 - 那里来处理在1GB,2GB或4GB的块阵列所用的时间之间的差别很小的。我的结论是什么缓存 np.memmap 阵列本身实现,这似乎是计算点公司的产品非常理想。

It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap arrays natively implement, it seems to be very suboptimal for computing dot products.

它仍然是一个有点疼痛不得不手动执行这种缓存策略,因为我的code可能要与不同数量的物理内存,并有可能不同操作系统的计算机上运行。出于这个原因,我在是否有办法控制,以提高 np.dot 的高性能内存映射阵列的缓存行为仍然有兴趣。

It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot.

我注意到一些奇怪的内存处理,因为我是运行标准的行为 - 当我叫 np.dot A 我从来没有见过我的Python进程的驻留集大小超过约3.8GB,即使我有大约可用RAM 7.5GB。这使我怀疑有委予 np.memmap 阵列允许占用的物理内存数量限制一些 - 我有previously认为它会使用任何RAM的操作系统允许它抓住。在我的情况下,它可能是非常有益的,能够增加此限制。

I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot on the whole of A I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.

没有人有任何进一步的深入了解 np.memmap 阵列,这将有助于解释这个的缓存行为?

Does anyone have any further insight into the caching behaviour of np.memmap arrays that would help to explain this?

def generate_random_mmarray(shape, fp, max_elements):
    A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
    max_rows = max(1, max_elements / shape[1])
    max_cols =  max_elements / max_rows
    for rr in _block_slices(shape[0], max_rows):
        for cc in _block_slices(shape[1], max_cols):
            A[rr, cc] = np.random.randn(*A[rr, cc].shape)
    return A

def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
              fpath='temp_array'):
    """
    time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
    (n, o) arrays resident in core memory
    """

    standard_times = []
    blockwise_times = []
    differences = []
    nbytes = n_gigabytes * 2 ** 30
    o = 64

    # float32 elements
    max_elements = int((max_block_gigabytes * 2 ** 30) / 4)

    for nb in nbytes:

        # float32 elements
        n = int(np.sqrt(nb / 4))

        with open(fpath, 'w+') as f:
            A = generate_random_mmarray((n, n), f, (max_elements / 2))
            B = np.random.randn(n, o).astype(np.float32)

            print "\n" + "-"*60
            print "A: %s\t(%i bytes)" %(A.shape, A.nbytes)
            print "B: %s\t\t(%i bytes)" %(B.shape, B.nbytes)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res1 = np.dot(A, B)
                t = time.time() - tic
                best = min(best, t)
            print "Normal dot:\t%imin %.2fsec" %divmod(best, 60)
            standard_times.append(best)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res2 = blockwise_dot(A, B, max_elements=max_elements)
                t = time.time() - tic
                best = min(best, t)
            print "Block-wise dot:\t%imin %.2fsec" %divmod(best, 60)
            blockwise_times.append(best)

            diff = np.linalg.norm(res1 - res2)
            print "L2 norm of difference:\t%g" %diff
            differences.append(diff)

        del A, B
        del res1, res2
        os.remove(fpath)

    return (np.array(standard_times), np.array(blockwise_times), 
            np.array(differences))

if __name__ == '__main__':
    n = np.logspace(2,5,4,base=2)
    standard_times, blockwise_times, differences = run_bench(
                                                    n_gigabytes=n,
                                                    max_block_gigabytes=4)

    np.savez('bench_results', standard_times=standard_times, 
             blockwise_times=blockwise_times, differences=differences)

这篇关于大内存映射阵列的有效点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆