Python是否可以从文件读取非ASCII文本? [英] Is it possible for Python to read non-ascii text from file?

查看:207
本文介绍了Python是否可以从文件读取非ASCII文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个UTF-8格式的.txt文件,无法将其读入Python.我有大量文件,转换将很麻烦.

I have a .txt file that is UTF-8 formatted and have problems to read it into Python. I have a large number of files and a conversion would be cumbersome.

因此,如果我通过读取文件

So if I read the file in via

for line in file_obj:
    ...

我收到以下错误:

  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 291: ordinal not in range(128)

我想x.decode("utf-8")无效,因为该错误甚至在读入该行之前就发生了.

I guess x.decode("utf-8") wouldn't work since the error occurs before the line is even read in.

推荐答案

有两种选择.

  1. 在打开文件时指定编码,而不使用默认值.
  2. 以二进制模式打开文件,并将decodebytes显式地显示为str.
  1. Specify the encoding when opening the file, instead of using the default.
  2. Open the file in binary mode, and explicitly decode from bytes to str.

第一个显然是更简单的一个.您没有显示如何打开文件,而是假设您的代码如下所示:

The first is obviously the simpler one. You don't show how you're opening the file, but assuming your code looks like this:

with open(path) as file_obj:
    for line in file_obj:

执行此操作:

with open(path, encoding='utf-8') as file_obj:
    for line in file_obj:

就是这样.

文档所述,如果您未指定文本模式下编码:

As the docs explain, if you don't specify an encoding in text mode:

默认编码取决于平台(无论locale.getpreferredencoding()返回什么),但是可以使用Python支持的任何编码.

The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.

在某些情况下(例如,任何OS X或具有适当配置的linux),locale.getpreferredencoding()始终为'UTF-8'.但是,它显然永远不会自动地对我可能打开的任何文件进行适当处理".因此,如果您知道文件为UTF-8,则应明确指定该文件.

In some cases (e.g., any OS X, or linux with an appropriate configuration), locale.getpreferredencoding() will always be 'UTF-8'. But it'll obviously never be "automatically whatever's right for any file I might open". So if you know a file is UTF-8, you should specify it explicitly.

这篇关于Python是否可以从文件读取非ASCII文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆