如何有效地读取和写入太大而无法容纳在内存中的文件? [英] How can I efficiently read and write files that are too large to fit in memory?

查看：73 发布时间：2020/5/8 21:56:15 python numpy memory-management

本文介绍了如何有效地读取和写入太大而无法容纳在内存中的文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试计算100,000个向量的余弦相似度，并且每个向量的维数都为200,000.

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.

通过阅读其他问题，我知道 memmap ，PyTables和h5py是处理此类数据的最佳选择，目前我正在使用两个memmap.一个用于读取向量，另一个用于存储余弦相似度矩阵.

From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.

这是我的代码:

import numpy as np
import scipy.spatial.distance as dist

xdim = 200000
ydim = 100000

wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))

for i in np.arange(ydim)):
    for j in np.arange(i+1,ydim):
        dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
        dmat.flush()

目前，htop报告我正在使用224G的VIRT内存和91.2G的RES内存，该内存正在稳步攀升.在我看来，到该过程结束时，整个输出矩阵都将存储在内存中，这是我要避免的事情.

Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.

问题: 这是对内存映射的正确用法吗?我是否以内存高效的方式写入输出文件(这意味着只有输入和输出文件的必要部分(即dmat[i,j]和wmat[:,i/j])存储在内存中? )?

QUESTION: Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j] and wmat[:,i/j], are stored in memory)?

如果没有，我做错了什么，该如何解决?

If not, what did I do wrong, and how can I fix this?

感谢您的任何建议！

我刚刚意识到htop报告的总系统内存使用率为12G，所以看来它毕竟在起作用...谁能启发我? RES现在为111G ...

I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...

memmap是由一维数组创建的，该数组由很多和很多接近于0的长小数组成，形状被调整为所需的尺寸.然后，内存映射如下所示.

The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.

memmap([[  9.83721223e-03,   4.42584107e-02,   9.85033578e-03, ...,
     -2.30691545e-07,  -1.65070799e-07,   5.99395837e-08],
   [  2.96711345e-04,  -3.84307391e-04,   4.92968462e-07, ...,
     -3.41317722e-08,   1.27959347e-09,   4.46846438e-08],
   [  1.64766260e-03,  -1.47337747e-05,   7.43660202e-07, ...,
      7.50395136e-08,  -2.51943163e-09,   1.25393555e-07],
   ..., 
   [ -1.88709000e-04,  -4.29454722e-06,   2.39720287e-08, ...,
     -1.53058717e-08,   4.48678211e-03,   2.48127260e-07],
   [ -3.34207882e-04,  -4.60275148e-05,   3.36992876e-07, ...,
     -2.30274532e-07,   2.51437794e-09,   1.25837564e-01],
   [  9.24923862e-04,  -1.59552854e-03,   2.68354822e-07, ...,
     -1.08862665e-05,   1.71283316e-07,   5.66851420e-01]])

如何有效地读取和写入太大而无法容纳在内存中的文件? [英] How can I efficiently read and write files that are too large to fit in memory?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何有效地读取和写入太大而无法容纳在内存中的文件? [英] How can I efficiently read and write files that are too large to fit in memory?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭