Python认为一个3000行的文本文件是一行长? [英] Python thinks a 3000-line text file is one line long?

查看:464
本文介绍了Python认为一个3000行的文本文件是一行长?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常长的文本文件,我试图使用Python处理。



但是,以下代码:

  for open in('textbase.txt','r'):
print'hello world'

只会产生以下输出:

  world 

就像Python认为文件只有一行长,长,在文本编辑器中查看。使用文件命令在命令行中检查它:

  $ file textbase.txt 
textbase.txt:Big-endian UTF-16 Unicode英文文本,带有CR行终止符

有问题?我需要更改行终止符吗?

解决方案

根据文档中为 open() 添加一个 U 更改为模式:

  open('textbase.txt','Ur')

这将启用 universal newlines ,它将它们标准化为 \\\


但是,正确的做法是在翻译换行符之前先将UTF-16BE解码为Unicode对象 。否则,一个机会 0x0d 字节可能会错误地变成 0x0a ,导致


UnicodeDecodeError:'utf16'编解码器无法解码位置12中的字节0x0a:截断的数据。


Python的 编解码器模块提供了可以同时解码Unicode并处理换行符的 open 函数:

  import codecs 
for codecs.open('textbase.txt','Ur','utf-16be'):
...

如果文件有字节顺序标记(BOM),并指定'utf-16' code>,那么它会检测字节序并为您隐藏BOM。如果没有(因为BOM是可选的),那么解码器将继续使用你的系统的字节序,这可能不是很好。



指定('utf-16be')不会隐藏BOM,因此您可能希望使用此黑客:

  import codecs 
firstline = True
for codecs.open('textbase.txt','Ur','utf-16be'):
如果第一行:
firstline = False
line = line.lstrip(u'\\\')

另请参阅: Python Unicode HOWTO


I have a very long text file that I'm trying to process using Python.

However, the following code:

for line in open('textbase.txt', 'r'):
    print 'hello world'

produces only the following output:

hello world

It's as though Python thinks the file is only one line long, though it is many thousands of lines long, when viewed in a text editor. Examining it on the command line using the file command gives:

$ file textbase.txt
textbase.txt: Big-endian UTF-16 Unicode English text, with CR line terminators

Is something wrong? Do I need to change the line terminators?

解决方案

According to the documentation for open(), you should add a U to the mode:

open('textbase.txt', 'Ur')

This enables "universal newlines", which normalizes them to \n in the strings it gives you.

However, the correct thing to do is to decode the UTF-16BE into Unicode objects first, before translating the newlines. Otherwise, a chance 0x0d byte could get erroneously turned into a 0x0a, resulting in

UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 12: truncated data.

Python's codecs module supplies an open function that can decode Unicode and handle newlines at the same time:

import codecs
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    ...

If the file has a byte order mark (BOM) and you specify 'utf-16', then it detects the endianness and hides the BOM for you. If it does not (since the BOM is optional), then that decoder will just go ahead and use your system's endianness, which probably won't be good.

Specifying the endianness yourself (with 'utf-16be') will not hide the BOM, so you might wish to use this hack:

import codecs
firstline = True
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    if firstline:
        firstline = False
        line = line.lstrip(u'\ufeff')

See also: Python Unicode HOWTO

这篇关于Python认为一个3000行的文本文件是一行长?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆