如何一次将一个超大文件读入numpy数组N行 [英] How to read a super huge file into numpy array N lines at a time

查看:320
本文介绍了如何一次将一个超大文件读入numpy数组N行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的文件(大约30GB),每行包括2D曲面上一个点的协调.我需要将文件加载到Numpy数组中: points = np.empty((0,2)),然后在其上应用 scipy.spatial.ConvexHull .由于文件很大,因此无法立即将其加载到内存中,因此我想将其作为N行加载,然后在小部分上应用 scipy.spatial.ConvexHull ,然后加载下N行!什么是有效的方法?
发现在python中您可以使用 islice 读取文件的N行,但问题是 lines_gen 是一个生成器对象,它为您提供文件的每一行,应在循环中使用,所以我不确定如何有效地将 lines_gen 转换为Numpy数组?

I have a huge file (around 30GB), each line includes coordination of a point on a 2D surface. I need to load the file into Numpy array: points = np.empty((0, 2)), and apply scipy.spatial.ConvexHull over it. Since the size of the file is very large I couldn't load it at once into the memory, I want to load it as batch of N lines and apply scipy.spatial.ConvexHull on the small part and then load the next N rows! What's an efficient to do it?
I found out that in python you can use islice to read N lines of a file but the problem is lines_gen is a generator object, which gives you each line of the file and should be used in a loop, so I am not sure how can I convert the lines_gen into Numpy array in an efficient way?

from itertools import islice
with open(input, 'r') as infile:
    lines_gen = islice(infile, N)

我的输入文件:

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0

推荐答案

使用您的数据,我可以像这样在5行代码中读取它:

With your data, I can read it in 5 line chunks like this:

In [182]: from itertools import islice
with open(input,'r') as infile:
    while True:
        gen = islice(infile,N)
        arr = np.genfromtxt(gen, dtype=None)
        print arr
        if arr.shape[0]<N:
            break
   .....:             
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]

读为一个块的相同内容是:

The same thing read as one chunk is:

In [183]: with open(input,'r') as infile:
    arr = np.genfromtxt(infile, dtype=None)
   .....:     
In [184]: arr
Out[184]: 
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
       (0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
       (0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
       (0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
       (0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
       (0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
       (0.95881, 1), (0.0, 0)], 
      dtype=[('f0', '<f8'), ('f1', '<i4')])

(这是Python 2.7;在3中,我需要解决字节/字符串问题).

(This is in Python 2.7; in 3 there's a byte/string issue I need to work around).

这篇关于如何一次将一个超大文件读入numpy数组N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆