是否有一个 Python 库函数试图猜测某些字节的字符编码? [英] Is there a Python library function which attempts to guess the character-encoding of some bytes?
问题描述
我正在用 Python 编写一些邮件处理软件,但在标题字段中遇到了奇怪的字节.我怀疑这只是格式错误的邮件;消息本身声称是 us-ascii,所以我认为没有真正的编码,但我想在不抛出 UnicodeDecodeError
的情况下得到一个近似原始字符串的 unicode 字符串.
I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError
.
所以,我正在寻找一个函数,它接受一个 str
和可选的一些提示,并尽最大努力给我返回一个 unicode
.我当然可以写一个,但如果存在这样的函数,它的作者可能会更深入地思考实现此目的的最佳方法.
So, I'm looking for a function that takes a str
and optionally some hints and does its darndest to give me back a unicode
. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.
我也知道 Python 的设计更喜欢显式而不是隐式,并且标准库旨在避免解码文本时的隐式魔法.我只想明确地说去猜吧".
I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".
推荐答案
据我所知,标准库没有一个函数,尽管按照上面的建议编写一个函数并不难.我认为我真正需要的是一种解码字符串并保证它不会抛出异常的方法.string.decode 的 errors 参数就是这样做的.
As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.
def decode(s, encodings=('ascii', 'utf8', 'latin1')):
for encoding in encodings:
try:
return s.decode(encoding)
except UnicodeDecodeError:
pass
return s.decode('ascii', 'ignore')
这篇关于是否有一个 Python 库函数试图猜测某些字节的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!