如何以内存高效的方式在python中拆分和解析大文本文件? [英] How to split and parse a big text file in python in a memory-efficient way?

查看：91 发布时间：2020/5/25 1:44:53 python parsing file-io ascii

本文介绍了如何以内存高效的方式在python中拆分和解析大文本文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的文本文件要解析. 主要模式如下:

I have quite a big text file to parse. The main pattern is as follows:

step 1

[n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 2

[n2 != n1 lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87
step 3

[(n3 != n1) and (n3 !=n2) lines of headers]

  3  3  2
 0.25    0.43   12.62    1.22    8.97
12.89   89.72   34.87   55.45   17.62
 4.25   16.78   98.01    1.16   32.26
 0.90    0.78   11.87

换句话说:

分隔符:步骤#

A separator: step #

已知长度的标头(行号，而不是字节)

Headers of known length (line numbers, not bytes)

数据3维形状:nz，ny，nx

Data 3-dimensional shape: nz, ny, nx

数据:fortran格式，原始数据集中每行约10个浮点数

Data: fortran formating, ~10 floats/line in the original dataset

我只想提取数据，将其转换为浮点数，将其放入numpy数组和ndarray中.将其重塑为给定的形状.

I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.

我已经做了一些编程工作……主要思想是

I've already done a bit of programming... The main idea is

首先获取每个分隔符的偏移量(步骤X")
跳过nX(n1，n2 ...)行+ 1以获取数据
从那里一直读取字节到下一个分隔符.

我一开始想避免使用正则表达式，因为这会使速度减慢很多.仅完成第一步(浏览文件以获取每个零件的偏移量)就已经需要3-4分钟.

I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).

问题是我基本上是使用file.tell()方法来获取分隔符位置:

The problem is that I'm basically using file.tell() method to get the separator positions:

[file.tell() - len(sep) for line in file if sep in line]

问题有两个:

对于较小的文件，file.tell()给出正确的分隔符位置，对于较长的文件，则没有.我怀疑file.tell()不应在循环中使用显式file.readline()或隐式for line in file(我都尝试过).我不知道，但结果在那里:对于大文件，[file.tell() for line in file if sep in line]不会不在分隔符之后系统地给出行的位置.
len(sep)没有给出正确的偏移量校正，以返回到分隔符"行的开头. sep是包含文件第一行(第一分隔符)的字符串(字节).

for smaller files, file.tell() gives the right separator positions, for longer files, it does not. I suspect that file.tell() should not be used in loops neither using explicit file.readline() nor using the implicit for line in file (I tried both). I don't know, but the result is there: with big files, [file.tell() for line in file if sep in line] does not give systematically the position of the line right after a separator.
len(sep) does not give the right offset correction to go back at the beginning of the "separator" line. sep is a string (bytes) containing the first line of the file (the first separator).

有人知道我应该怎么解析吗?

Does anyone knows how I should parse that?

注意:我首先找到偏移量是因为我希望能够浏览文件内部:我可能只想要第10个数据集或第50000个数据集...

NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...

sep = "step "
with open("myfile") as f_in:
    offsets = [fin.tell() for line in fin if sep in line]

正如我所说，这在简单的示例中有效，但不适用于大文件.

As I said, this is working in the simple example, but not on the big file.

新测试:

sep = "step "
offsets = []
with open("myfile") as f_in:
    for line in f_in:
        if sep in line:
            print line
            offsets.append(f_in.tell())

毫无疑问，打印的行与分隔符相对应.但是用f_in.tell()获得的偏移量不对应于下一行.我猜文件是在内存中缓冲的，当我尝试在隐式循环中使用f_in.tell()时，我没有得到当前位置，而是得到了缓冲区的末尾.这只是一个疯狂的猜测.

The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell() do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell() in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.

如何以内存高效的方式在python中拆分和解析大文本文件? [英] How to split and parse a big text file in python in a memory-efficient way?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何以内存高效的方式在python中拆分和解析大文本文件? [英] How to split and parse a big text file in python in a memory-efficient way?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭