使用python解析大型(20GB)文本文件-以2行作为1读取 [英] Parsing large (20GB) text file with python - reading in 2 lines as 1

查看:76
本文介绍了使用python解析大型(20GB)文本文件-以2行作为1读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析20Gb文件,并将满足特定条件的行输出到另一个文件,但是python偶尔会一次读入两行并将它们连接起来.

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

我已经检查了源文件中的行尾,它们作为换行符(ascii char 10)检出.拔出问题行并进行隔离分析可以按预期进行.我在这里遇到一些python限制吗?第一个异常在文件中的位置约为4GB.

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.

推荐答案

通过google快速搜索"python读取大于4gb的文件",产生了许多结果.有关此类示例,请参见此处 以及另一个从第一个接管.

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

这是Python中的错误.

It's a bug in Python.

现在,对错误的解释;复制不容易,因为它既取决于内部FILE缓冲区的大小,又取决于传递给fread()的字符数.在Microsoft CRT源代码的open.c中,有一个以令人鼓舞的注释开头的代码块:这是困难的部分.我们在缓冲区末尾找到了CR.我们必须先窥视一下下一个char是否为LF."奇怪的是,在Perl源代码中几乎可以完全复制此函数: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 问题出在对SetFilePointer()的调用中,该调用用于在向前行之后退回一个位置;它将失败,因为它无法返回32位DWORD中的当前位置.[修复很容易;看到了吗?]此时,该函数认为下一个read()将返回LF,但这不是因为文件指针没有移回.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

解决方法:

但是请注意,Python 3.x不会受到影响(原始文件始终以二进制模式打开,CRLF转换由Python完成);在2.7中,您可以使用io.open().

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

这篇关于使用python解析大型(20GB)文本文件-以2行作为1读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆