使用Python和内存有效的方式来导入2D数据 [英] Python and memory efficient way of importing 2d data
问题描述
我正在尝试运行一些脚本来使用Python分析数据,而我却惊讶地发现它占用了多少RAM空间:
I'm trying to run a few scripts analyzing data with Python and I've been quickly surprised by how much RAM space it takes:
我的脚本从文件中读取两列整数.它将通过以下方式导入:
My script reads two columns of integers from a file. It imports it in the following way:
import numpy as N
from sys import argv
infile = argv[1]
data = N.loadtxt(infile,dtype=N.int32) //infile is the input file
对于一个几乎有800万行的文件,它需要大约1.5 Gb的ram(在此阶段,它所做的只是导入数据).
For a file with almost 8 million lines, it takes around 1.5 Gb in ram (at this stages all it does is importing the data).
我尝试在其上运行内存探查器,给我:
I tried running a memory profiler on it, giving me:
5 17.664 MiB 0.000 MiB @profile
6 def func():
7 17.668 MiB 0.004 MiB infile = argv[1]
8 258.980 MiB 241.312 MiB data = N.loadtxt(infile,dtype=N.int32)
那么250Mb的数据,就远不是1.5Gb的内存(占用了多少空间?)
so 250Mb for the data, far from the 1.5Gb in memory (what is occupying so much space?)
,当我尝试使用int16而不是int32将其除以2时:
and when I tried dividing it by 2 by using int16 instead of int32:
5 17.664 MiB 0.000 MiB @profile
6 def func():
7 17.668 MiB 0.004 MiB infile = argv[1]
8 229.387 MiB 211.719 MiB data = N.loadtxt(infile,dtype=N.int16)
但是我只存十分之一,怎么会这样?
But I'm only saving a tenth, how come?
我对内存占用不了解太多,但这是正常现象吗?
I don't know much of memory occupation, but is this normal?
此外,我用C ++编写了相同的代码,将数据存储在vector<int>
对象中,而RAM中仅占用120Mb.
Also, I coded the same thing in C++ storing the data in vector<int>
objects and it only takes 120Mb in RAM.
对我来说,Python在处理内存方面似乎扫地出门,这是在做什么,这增加了数据的重量?它与Numpy更相关吗?
To me, Python seems to sweep a lot under the rug when it comes to handling the memory, what is it doing that inflates the weight of the data? Is it more related to Numpy?
受以下答案的启发,我现在通过以下方式导入数据:
Inspired by the answer below, I'm now importing my data the following way:
infile = argv[1]
output = commands.getoutput("wc -l " + infile) #I'm using the wc linux command to read the number of lines in my file and so how much memory allocation do I need
n_lines = int(output.split(" ")[0]) #the first int is the number of lines
data = N.empty((n_lines,2),dtype=N.int16) #allocating
datafile = open(infile)
for count,line in enumerate(datafile): #reading line by line
data[count] = line.split(" ") #filling the array
它对于多个文件也非常相似:
It also works very similarly with multiple files:
infiles = argv[1:]
n_lines = sum(int(commands.getoutput("wc -l " + infile).split(" ")[0]) for infile in infiles)
i = 0
data = N.empty((n_lines,2),dtype=N.int16)
for infile in infiles:
datafile = open(infile)
for line in datafile:
data[i] = line.split(" ")
i+=1
罪魁祸首似乎是numpy.loadtxt
,将其删除后,我的脚本现在不需要过多的内存,甚至可以快2-3倍=)
The culprit seemed to be numpy.loadtxt
, after removing it my script now doesn't need an extravagant amount of memory and even runs 2-3 times faster =)
推荐答案
loadtxt()
方法的内存效率不高,因为它使用Python列表临时存储文件内容. 此处简短说明了为什么Python列表占用这么多空间.
The loadtxt()
method is not memory efficient because it uses a Python list to temporary store file contents. Here is a short explanation of why Python list take so much space.
一种解决方案是创建自己的读取文本文件的实现,如下所示:
One solution is to create your own implementation for reading text files, as below:
buffsize = 10000 # Increase this for large files
data = N.empty((buffsize, ncols)) # Init array with buffsize
dataFile = open(infile)
for count, line in enumerate(dataFile):
if count >= len(data):
data.resize((count + buffsize, ncols), recheck=False)
line_values = ... <convert line into values> ...
data[count] = line_values
# Fix array size
data.resize((count+1, ncols), recheck=False)
dataFile.close()
由于有时我们无法提前获取行数,因此我定义了一种缓冲以避免始终调整数组大小.
As sometimes we couldn't get the line count in advance, I defined a kind of buffering to avoid resizing the array all the time.
注意:起初,我想出了使用numpy.append
的解决方案.但是正如评论中指出的那样,append
效率不高,因为它可以复制数组内容.
Note: at first, I came up with a solution using numpy.append
. But as pointed out in the comments, append
is also inefficient since it make a copy of the array contents.
这篇关于使用Python和内存有效的方式来导入2D数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!