使用Python和内存有效的方式来导入2D数据 [英] Python and memory efficient way of importing 2d data

查看:89
本文介绍了使用Python和内存有效的方式来导入2D数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行一些脚本来使用Python分析数据,而我却惊讶地发现它占用了多少RAM空间:

I'm trying to run a few scripts analyzing data with Python and I've been quickly surprised by how much RAM space it takes:

我的脚本从文件中读取两列整数.它将通过以下方式导入:

My script reads two columns of integers from a file. It imports it in the following way:

import numpy as N
from sys import argv
infile = argv[1]
data = N.loadtxt(infile,dtype=N.int32)  //infile is the input file

对于一个几乎有800万行的文件,它需要大约1.5 Gb的ram(在此阶段,它所做的只是导入数据).

For a file with almost 8 million lines, it takes around 1.5 Gb in ram (at this stages all it does is importing the data).

我尝试在其上运行内存探查器,给我:

I tried running a memory profiler on it, giving me:

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  258.980 MiB  241.312 MiB    data = N.loadtxt(infile,dtype=N.int32)

那么250Mb的数据,就远不是1.5Gb的内存(占用了多少空间?)

so 250Mb for the data, far from the 1.5Gb in memory (what is occupying so much space?)

,当我尝试使用int16而不是int32将其除以2时:

and when I tried dividing it by 2 by using int16 instead of int32:

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  229.387 MiB  211.719 MiB    data = N.loadtxt(infile,dtype=N.int16)  

但是我只存十分之一,怎么会这样?

But I'm only saving a tenth, how come?

我对内存占用不了解太多,但这是正常现象吗?

I don't know much of memory occupation, but is this normal?

此外,我用C ++编写了相同的代码,将数据存储在vector<int>对象中,而RAM中仅占用120Mb.

Also, I coded the same thing in C++ storing the data in vector<int> objects and it only takes 120Mb in RAM.

对我来说,Python在处理内存方面似乎扫地出门,这是在做什么,这增加了数据的重量?它与Numpy更相关吗?

To me, Python seems to sweep a lot under the rug when it comes to handling the memory, what is it doing that inflates the weight of the data? Is it more related to Numpy?

受以下答案的启发,我现在通过以下方式导入数据:

Inspired by the answer below, I'm now importing my data the following way:

infile = argv[1]
output = commands.getoutput("wc -l " + infile) #I'm using the wc linux command to read the number of lines in my file and so how much memory allocation do I need
n_lines = int(output.split(" ")[0]) #the first int is the number of lines
data = N.empty((n_lines,2),dtype=N.int16) #allocating
datafile = open(infile)
for count,line in enumerate(datafile): #reading line by line
    data[count] = line.split(" ") #filling the array

它对于多个文件也非常相似:

It also works very similarly with multiple files:

infiles = argv[1:]
n_lines = sum(int(commands.getoutput("wc -l " + infile).split(" ")[0]) for infile in infiles)
i = 0
data = N.empty((n_lines,2),dtype=N.int16)
for infile in infiles:
    datafile = open(infile)
    for line in datafile:
        data[i] = line.split(" ")
        i+=1

罪魁祸首似乎是numpy.loadtxt,将其删除后,我的脚本现在不需要过多的内存,甚至可以快2-3倍=)

The culprit seemed to be numpy.loadtxt, after removing it my script now doesn't need an extravagant amount of memory and even runs 2-3 times faster =)

推荐答案

loadtxt()方法的内存效率不高,因为它使用Python列表临时存储文件内容. 此处简短说明了为什么Python列表占用这么多空间.

The loadtxt() method is not memory efficient because it uses a Python list to temporary store file contents. Here is a short explanation of why Python list take so much space.

一种解决方案是创建自己的读取文本文件的实现,如下所示:

One solution is to create your own implementation for reading text files, as below:

buffsize = 10000  # Increase this for large files
data = N.empty((buffsize, ncols))  # Init array with buffsize
dataFile = open(infile)

for count, line in enumerate(dataFile):
   if count >= len(data):
       data.resize((count + buffsize, ncols), recheck=False)
   line_values = ... <convert line into values> ...
   data[count] = line_values

# Fix array size
data.resize((count+1, ncols), recheck=False)
dataFile.close()

由于有时我们无法提前获取行数,因此我定义了一种缓冲以避免始终调整数组大小.

As sometimes we couldn't get the line count in advance, I defined a kind of buffering to avoid resizing the array all the time.

注意:起初,我想出了使用numpy.append的解决方案.但是正如评论中指出的那样,append效率不高,因为它可以复制数组内容.

Note: at first, I came up with a solution using numpy.append. But as pointed out in the comments, append is also inefficient since it make a copy of the array contents.

这篇关于使用Python和内存有效的方式来导入2D数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆