如何以内存高效的方式在python中拆分和解析大文本文件? [英] How to split and parse a big text file in python in a memory-efficient way?
问题描述
我有一个很大的文本文件要解析. 主要模式如下:
I have quite a big text file to parse. The main pattern is as follows:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
换句话说:
分隔符:步骤#
A separator: step #
已知长度的标头(行号,而不是字节)
Headers of known length (line numbers, not bytes)
数据3维形状:nz,ny,nx
Data 3-dimensional shape: nz, ny, nx
数据:fortran格式,原始数据集中每行约10个浮点数
Data: fortran formating, ~10 floats/line in the original dataset
我只想提取数据,将其转换为浮点数,将其放入numpy数组和ndarray中.将其重塑为给定的形状.
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.
我已经做了一些编程工作……主要思想是
I've already done a bit of programming... The main idea is
- 首先获取每个分隔符的偏移量(步骤X")
- 跳过nX(n1,n2 ...)行+ 1以获取数据
- 从那里一直读取字节到下一个分隔符.
我一开始想避免使用正则表达式,因为这会使速度减慢很多.仅完成第一步(浏览文件以获取每个零件的偏移量)就已经需要3-4分钟.
I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
问题是我基本上是使用file.tell()
方法来获取分隔符位置:
The problem is that I'm basically using file.tell()
method to get the separator positions:
[file.tell() - len(sep) for line in file if sep in line]
问题有两个:
- 对于较小的文件,
file.tell()
给出正确的分隔符位置,对于较长的文件,则没有.我怀疑file.tell()
不应在循环中使用显式file.readline()
或隐式for line in file
(我都尝试过).我不知道,但结果在那里:对于大文件,[file.tell() for line in file if sep in line]
不会不在分隔符之后系统地给出行的位置. - len(sep)没有给出正确的偏移量校正,以返回到分隔符"行的开头.
sep
是包含文件第一行(第一分隔符)的字符串(字节).
- for smaller files,
file.tell()
gives the right separator positions, for longer files, it does not. I suspect thatfile.tell()
should not be used in loops neither using explicitfile.readline()
nor using the implicitfor line in file
(I tried both). I don't know, but the result is there: with big files,[file.tell() for line in file if sep in line]
does not give systematically the position of the line right after a separator. - len(sep) does not give the right offset correction to go back at the beginning of the "separator" line.
sep
is a string (bytes) containing the first line of the file (the first separator).
有人知道我应该怎么解析吗?
Does anyone knows how I should parse that?
注意:我首先找到偏移量是因为我希望能够浏览文件内部:我可能只想要第10个数据集或第50000个数据集...
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
正如我所说,这在简单的示例中有效,但不适用于大文件.
As I said, this is working in the simple example, but not on the big file.
新测试:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
毫无疑问,打印的行与分隔符相对应.但是用f_in.tell()
获得的偏移量不对应于下一行.我猜文件是在内存中缓冲的,当我尝试在隐式循环中使用f_in.tell()
时,我没有得到当前位置,而是得到了缓冲区的末尾.这只是一个疯狂的猜测.
The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell()
do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell()
in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.
推荐答案
我得到了答案:文件上的for
-循环和tell()
的相处不是很好.就像将for i in file
和file.readline()
混合会产生错误一样.
I got the answer: for
-loops on a file and tell()
do not get along very well. Just like mixing for i in file
and file.readline()
raises an error.
因此,将file.tell()
与file.readline()
或file.read()
仅 一起使用.
So, use file.tell()
with file.readline()
or file.read()
only.
从不使用:
for line in file:
[do stuff]
offset = file.tell()
这确实是一种耻辱,但事实就是这样.
This is really a shame but that's the way it is.
这篇关于如何以内存高效的方式在python中拆分和解析大文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!