Python认为一个3000行的文本文件是一行长? [英] Python thinks a 3000-line text file is one line long?
问题描述
我有一个非常长的文本文件,我试图使用Python处理。
但是,以下代码:
for open in('textbase.txt','r'):
print'hello world'
只会产生以下输出:
world
就像Python认为文件只有一行长,长,在文本编辑器中查看。使用文件命令在命令行中检查它:
$ file textbase.txt
textbase.txt:Big-endian UTF-16 Unicode英文文本,带有CR行终止符
有问题?我需要更改行终止符吗?
根据文档中为 open()
添加一个 U
更改为模式:
open('textbase.txt','Ur')
这将启用 universal newlines ,它将它们标准化为 \\\
。
但是,正确的做法是在翻译换行符之前先将UTF-16BE解码为Unicode对象 。否则,一个机会 0x0d
字节可能会错误地变成 0x0a
,导致
UnicodeDecodeError:'utf16'编解码器无法解码位置12中的字节0x0a:截断的数据。
Python的 编解码器
模块提供了可以同时解码Unicode并处理换行符的 open
函数:
import codecs
for codecs.open('textbase.txt','Ur','utf-16be'):
...
如果文件有字节顺序标记(BOM),并指定'utf-16' code>,那么它会检测字节序并为您隐藏BOM。如果没有(因为BOM是可选的),那么解码器将继续使用你的系统的字节序,这可能不是很好。
指定('utf-16be'
)不会隐藏BOM,因此您可能希望使用此黑客:
import codecs
firstline = True
for codecs.open('textbase.txt','Ur','utf-16be'):
如果第一行:
firstline = False
line = line.lstrip(u'\\\')
另请参阅: Python Unicode HOWTO
I have a very long text file that I'm trying to process using Python.
However, the following code:
for line in open('textbase.txt', 'r'):
print 'hello world'
produces only the following output:
hello world
It's as though Python thinks the file is only one line long, though it is many thousands of lines long, when viewed in a text editor. Examining it on the command line using the file command gives:
$ file textbase.txt
textbase.txt: Big-endian UTF-16 Unicode English text, with CR line terminators
Is something wrong? Do I need to change the line terminators?
According to the documentation for open()
, you should add a U
to the mode:
open('textbase.txt', 'Ur')
This enables "universal newlines", which normalizes them to \n
in the strings it gives you.
However, the correct thing to do is to decode the UTF-16BE into Unicode objects first, before translating the newlines. Otherwise, a chance 0x0d
byte could get erroneously turned into a 0x0a
, resulting in
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 12: truncated data.
Python's codecs
module supplies an open
function that can decode Unicode and handle newlines at the same time:
import codecs
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
...
If the file has a byte order mark (BOM) and you specify 'utf-16'
, then it detects the endianness and hides the BOM for you. If it does not (since the BOM is optional), then that decoder will just go ahead and use your system's endianness, which probably won't be good.
Specifying the endianness yourself (with 'utf-16be'
) will not hide the BOM, so you might wish to use this hack:
import codecs
firstline = True
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
if firstline:
firstline = False
line = line.lstrip(u'\ufeff')
See also: Python Unicode HOWTO
这篇关于Python认为一个3000行的文本文件是一行长?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!