如何一次将一个超大文件读入numpy数组N行 [英] How to read a super huge file into numpy array N lines at a time
问题描述
我有一个很大的文件(大约30GB),每行包括2D曲面上一个点的协调.我需要将文件加载到Numpy数组中: points = np.empty((0,2))
,然后在其上应用 scipy.spatial.ConvexHull
.由于文件很大,因此无法立即将其加载到内存中,因此我想将其作为N行加载,然后在小部分上应用 scipy.spatial.ConvexHull
,然后加载下N行!什么是有效的方法?
我发现在python中您可以使用 islice
读取文件的N行,但问题是 lines_gen
是一个生成器对象,它为您提供文件的每一行,应在循环中使用,所以我不确定如何有效地将 lines_gen
转换为Numpy数组?
I have a huge file (around 30GB), each line includes coordination of a point on a 2D surface. I need to load the file into Numpy array: points = np.empty((0, 2))
, and apply scipy.spatial.ConvexHull
over it. Since the size of the file is very large I couldn't load it at once into the memory, I want to load it as batch of N lines and apply scipy.spatial.ConvexHull
on the small part and then load the next N rows! What's an efficient to do it?
I found out that in python you can use islice
to read N lines of a file but the problem is lines_gen
is a generator object, which gives you each line of the file and should be used in a loop, so I am not sure how can I convert the lines_gen
into Numpy array in an efficient way?
from itertools import islice
with open(input, 'r') as infile:
lines_gen = islice(infile, N)
我的输入文件:
0.989703 1
0 0
0.0102975 0
0.0102975 0
1 1
0.989703 1
1 1
0 0
0.0102975 0
0.989703 1
0.979405 1
0 0
0.020595 0
0.020595 0
1 1
0.979405 1
1 1
0 0
0.020595 0
0.979405 1
0.969108 1
...
...
...
0 0
0.0308924 0
0.0308924 0
1 1
0.969108 1
1 1
0 0
0.0308924 0
0.969108 1
0.95881 1
0 0
推荐答案
使用您的数据,我可以像这样在5行代码中读取它:
With your data, I can read it in 5 line chunks like this:
In [182]: from itertools import islice
with open(input,'r') as infile:
while True:
gen = islice(infile,N)
arr = np.genfromtxt(gen, dtype=None)
print arr
if arr.shape[0]<N:
break
.....:
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]
读为一个块的相同内容是:
The same thing read as one chunk is:
In [183]: with open(input,'r') as infile:
arr = np.genfromtxt(infile, dtype=None)
.....:
In [184]: arr
Out[184]:
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
(0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
(0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
(0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
(0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
(0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
(0.95881, 1), (0.0, 0)],
dtype=[('f0', '<f8'), ('f1', '<i4')])
(这是Python 2.7;在3中,我需要解决字节/字符串问题).
(This is in Python 2.7; in 3 there's a byte/string issue I need to work around).
这篇关于如何一次将一个超大文件读入numpy数组N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!