如何有效地读取和写入太大而无法容纳在内存中的文件? [英] How can I efficiently read and write files that are too large to fit in memory?

查看:73
本文介绍了如何有效地读取和写入太大而无法容纳在内存中的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算100,000个向量的余弦相似度,并且每个向量的维数都为200,000.

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.

通过阅读其他问题,我知道 memmap ,PyTables和h5py是处理此类数据的最佳选择,目前我正在使用两个memmap.一个用于读取向量,另一个用于存储余弦相似度矩阵.

From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.

这是我的代码:

import numpy as np
import scipy.spatial.distance as dist

xdim = 200000
ydim = 100000

wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))

for i in np.arange(ydim)):
    for j in np.arange(i+1,ydim):
        dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
        dmat.flush()

目前,htop报告我正在使用224G的VIRT内存和91.2G的RES内存,该内存正在稳步攀升.在我看来,到该过程结束时,整个输出矩阵都将存储在内存中,这是我要避免的事情.

Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.

问题: 这是对内存映射的正确用法吗?我是否以内存高效的方式写入输出文件(这意味着只有输入和输出文件的必要部分(即dmat[i,j]wmat[:,i/j])存储在内存中? )?

QUESTION: Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j] and wmat[:,i/j], are stored in memory)?

如果没有,我做错了什么,该如何解决?

If not, what did I do wrong, and how can I fix this?

感谢您的任何建议!

我刚刚意识到htop报告的总系统内存使用率为12G,所以看来它毕竟在起作用...谁能启发我? RES现在为111G ...

I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...

memmap是由一维数组创建的,该数组由很多和很多接近于0的长小数组成,形状被调整为所需的尺寸.然后,内存映射如下所示.

The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.

memmap([[  9.83721223e-03,   4.42584107e-02,   9.85033578e-03, ...,
     -2.30691545e-07,  -1.65070799e-07,   5.99395837e-08],
   [  2.96711345e-04,  -3.84307391e-04,   4.92968462e-07, ...,
     -3.41317722e-08,   1.27959347e-09,   4.46846438e-08],
   [  1.64766260e-03,  -1.47337747e-05,   7.43660202e-07, ...,
      7.50395136e-08,  -2.51943163e-09,   1.25393555e-07],
   ..., 
   [ -1.88709000e-04,  -4.29454722e-06,   2.39720287e-08, ...,
     -1.53058717e-08,   4.48678211e-03,   2.48127260e-07],
   [ -3.34207882e-04,  -4.60275148e-05,   3.36992876e-07, ...,
     -2.30274532e-07,   2.51437794e-09,   1.25837564e-01],
   [  9.24923862e-04,  -1.59552854e-03,   2.68354822e-07, ...,
     -1.08862665e-05,   1.71283316e-07,   5.66851420e-01]])

推荐答案

就内存使用而言,您目前所做的事情没有什么特别的错误.映射数组是在OS级别处理的-要写入的数据通常保存在临时缓冲区中,并且仅在OS认为必要时才提交到磁盘.在刷新写缓冲区之前,您的操作系统绝不允许您用尽物理内存.

In terms of memory usage, there's nothing particularly wrong with what you're doing at the moment. Memmapped arrays are handled at the level of the OS - data to be written is usually held in a temporary buffer, and only committed to disk when the OS deems it necessary. Your OS should never allow you to run out of physical memory before flushing the write buffer.

建议不要在每次迭代时都调用flush,因为这样做会破坏让您的操作系统决定何时写入磁盘以最大化效率的目的.目前,您一次只能写入单个浮点值.

I'd advise against calling flush on every iteration since this defeats the purpose of letting your OS decide when to write to disk in order to maximise efficiency. At the moment you're only writing individual float values at a time.

就IO和CPU效率而言,一次运行在单行上几乎肯定不是最佳选择.通常,对于较大的连续数据块,读写速度更快,同样,如果您可以使用向量化一次处理多行数据,则您的计算可能会更快.一般的经验法则是处理数组中尽可能大的内存块(包括在计算过程中创建的任何中间数组).

In terms of IO and CPU efficiency, operating on a single line at a time is almost certainly suboptimal. Reads and writes are generally quicker for large, contiguous blocks of data, and likewise your calculation will probably be much faster if you can process many lines at once using vectorization. The general rule of thumb is to process as big a chunk of your array as will fit in memory (including any intermediate arrays that are created during your computation).

这里是一个示例,其中显示了通过对大小适当的块进行处理,可以最大程度地加快对内存阵列的操作.

Here's an example showing how much you can speed up operations on memmapped arrays by processing them in appropriately-sized chunks.

可以产生巨大变化的另一件事是输入和输出数组的内存布局.默认情况下,np.memmap为您提供C连续(行为主)数组.因此,按列访问wmat的效率非常低,因为您要寻址磁盘上不相邻的位置.如果wmat在磁盘上是F连续的(列大),或者按行访问它,则可能会更好.

Another thing that can make a huge difference is the memory layout of your input and output arrays. By default, np.memmap gives you a C-contiguous (row-major) array. Accessing wmat by column will therefore be very inefficient, since you're addressing non-adjacent locations on disk. You would be much better off if wmat was F-contiguous (column-major) on disk, or if you were accessing it by row.

使用HDF5代替内存映射也适用相同的一般建议,尽管请记住,使用HDF5必须自己处理所有内存管理.

The same general advice applies to using HDF5 instead of memmaps, although bear in mind that with HDF5 you will have to handle all the memory management yourself.

这篇关于如何有效地读取和写入太大而无法容纳在内存中的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆