使用Python 3的readlines()进行Unicode错误处理 [英] Unicode error handling with Python 3's readlines()
问题描述
在读取文本文件时,我一直收到此错误。
I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?
UnicodeEncodeError:'charmap'编解码器无法解码位置$ b $的字节0x81是否可以处理/忽略它并继续进行? b 7827:字符映射为未定义。
UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined.
推荐答案
在Python 3中,传递适当的 errors =
值(例如 errors = ignore
或 errors = replace
)创建文件对象(假设它是 io.TextIOWrapper
的子类;如果不是,请考虑将其包装在一个对象中!);另外,考虑传递比 charmap
更可能的编码(当不确定时, utf-8
总是很好的起点)。
In Python 3, pass an appropriate errors=
value (such as errors=ignore
or errors=replace
) on creating your file object (presuming it to be a subclass of io.TextIOWrapper
-- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap
(when you aren't sure, utf-8
is always a good place to start).
例如:
f = open('misc-notes.txt', encoding='utf-8', errors='ignore')
在Python 2中, read()
操作仅返回字节;然后,诀窍是将它们解码以将它们放入字符串中(实际上,如果需要,则需要字符而不是字节)。如果您对它们的真实编码没有更好的猜测:
In Python 2, the read()
operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:
your_string.decode('utf-8', 'replace')
...以替换未处理的字符,或者
...to replace unhandled characters, or
your_string.decode('utf-8', 'ignore')
只是忽略它们。
也就是说,找到并使用其 real 编码(而不是猜测 utf-8
)。
That said, finding and using their real encoding (rather than guessing utf-8
) would be preferred.
这篇关于使用Python 3的readlines()进行Unicode错误处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!