是否有一个Python库函数试图猜测一些字节的字符编码? [英] Is there a Python library function which attempts to guess the character-encoding of some bytes?

查看:382
本文介绍了是否有一个Python库函数试图猜测一些字节的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中编写一些邮件处理软件,在标题字段中遇到奇怪的字节。我怀疑这只是畸形的邮件;消息本身声称是us-ascii,所以我不认为有一个真正的编码,但我想出一个unicode字符串接近原始的字符串,而不抛出一个 UnicodeDecodeError

I'm writing some mail-processing software in Python that is encountering strange bytes in header fields. I suspect this is just malformed mail; the message itself claims to be us-ascii, so I don't think there is a true encoding, but I'd like to get out a unicode string approximating the original one without throwing a UnicodeDecodeError.

因此,我正在寻找一个函数,它接受一个 str 并给我回来一个 unicode 它的darndest。我可以写一个当然,但如果这样的函数存在,它的作者可能想更深入一些最好的方式去这个。

So, I'm looking for a function that takes a str and optionally some hints and does its darndest to give me back a unicode. I could write one of course, but if such a function exists its author has probably thought a bit deeper about the best way to go about this.

我也知道Python的设计更倾向于显式隐式,并且标准库设计为避免在解码文本中的隐式魔术。我只想明确说去猜猜。

I also know that Python's design prefers explicit to implicit and that the standard library is designed to avoid implicit magic in decoding text. I just want to explicitly say "go ahead and guess".

推荐答案

据我所知, t有一个功能,虽然它不是太难写如上所建议。我认为我正在寻找的真正的东西是一种解码字符串,并保证它不会抛出异常的方法。 string.decode的错误参数就是这样。

As far as I can tell, the standard library doesn't have a function, though it's not too difficult to write one as suggested above. I think the real thing I was looking for was a way to decode a string and guarantee that it wouldn't throw an exception. The errors parameter to string.decode does that.

def decode(s, encodings=('ascii', 'utf8', 'latin1')):
    for encoding in encodings:
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            pass
    return s.decode('ascii', 'ignore')

这篇关于是否有一个Python库函数试图猜测一些字节的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆