无法从CPython读取巨大的(20GB)文件 [英] Unable to read huge (20GB) file from CPython

查看:69
本文介绍了无法从CPython读取巨大的(20GB)文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些我无法理解的CPython问题.归结为以下事实:使用相同的代码可以读取小型文本文件,但甚至无法从20GB的txt文件中读取一行.

I have some CPython issue that I cannot understand. It all boils down to the fact that using the same code to read small text file works but cannot even read a single line from 20GB txt file.

一些有用的信息:

  • 较小的文件〜1MB是20GB的大文件的子集(从一开始就是1MB)
  • 两个文件都是文本文件,其行宽约2000个字符,以CR(\ r)分隔

显而易见的解决方案:

f = open(r'filename', 'r')
for line in f:
    print(line)
f.close()

有效...但是..仅适用于短文件.因为大的文件永远挂着(或者至少要打印第一行才需要更长的时间).

works...but..only for short file. For the big one hangs forever (or longer that it should take to print at least the first line).

所以我至少想尝试读这样的一行:

So I wanted to at least try to read one line like this:

f = open(r'filename', 'r')
print(f.readline())
f.close()

这里的情况类似-立即处理较小的文件,但经过大量时间吐出该消息后才处理较大的文件:

Similar situation here - works for small file instantly but for the big one after substantial amount of time spits that message:

Traceback (most recent call last):
  File "***", line 16, in <module>
    print(f.readline())
SystemError: ..\Objects\stringobject.c:3902: bad argument to internal function

我该怎么读一个大文本文件?

How the heck should I read a big text file?

更新:

结果证明,人们认为睡眠充足会更清楚;-).问题已解决-事实证明我忽略了文档中的一句话:

Turns out human being thinks clearer whan having enough sleep ;-). The problem is solved - turns out I've overlooked one sentence in the documentation:

Python通常是在通用换行符支持下构建的; 提供"U"会以文本文件的形式打开文件,但是行可能会被以下任一字符终止:Unix的行尾约定"\ n",Macintosh的约定"\ r" ,或Windows约定'\ r \ n'.

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.

仅考虑默认情况下通用换行符已打开.

Just thought universal newlines are 'turned on' by default.

我的上述声明:

print(f.readline())

正在读的只有一行是部分错误(我的错).还记得我说过我的小文件是通过提取大文件中的一部分来创建的吗?在该操作过程中,行的结尾从(CR)更改为(CRLF),所以我看到的是第一行.所有这些使我认为问题不在行尾.

was reading just one line was partially false (my bad). Remember I said my small file was created by taking chunk of the big one? During that operation line endings changed from (CR) to (CRLF) so what I saw was the first line. All of that made me think that problem is not in line endings.

谢谢大家的时间和帮助.

Thank you all for time and help.

推荐答案

尽管测试"仅打印一行,但这并不意味着它仅从文件读取一行.对我来说,在\r分隔的测试文件中,我也只能得到一行输出.但是,如果我使用for循环读取每一行,则它 still 仅打印一行.或者,如果我第二次尝试在多行文件中使用readline(),则该文件不再显示任何行.

Although your "test" only prints one line, that does not mean it is only reading one line from the file. For me in a \r-delimited test file, I also only get one line of output. However if I read each line in using a for loop, it still only prints one line. Or if I try readline() a second time on a multi-line file, it doesn't give any more lines.

尝试在同一文件上使用'rU'参数打开文件:

Try opening your file with the 'rU' parameter on the same file:

f =  open('filename', 'rU')

我对带有多行以\r分隔的文本的文件的测试给出:

My tests of a file with several lines of \r-delimited text give:

f = open('test.txt','r')  # Opening the "wrong" way
for line in f:
    print line

输出:

abcdef

然后使用rU:

f = open('test.txt','rU')
for line in f:
    print line

输出:

abcdef

abcdef

abcdef

abcdef

abcdef


为支持Joran的解释,该测试几乎表明,当您仅看到一行内容时,就是整个文件正在加载并且回车符导致套印输出...


In support of Joran's explanation, this test pretty much shows it to be the case that the entire file is loading and the carriage return character is causing over-printing when you see only one line of output...

f = open('test.txt','r')     #  Opening the "wrong" way again
for line in f:
    print "XXX{}YYY".format(line)

输出被覆盖...

YYYdefdef

这篇关于无法从CPython读取巨大的(20GB)文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆