解码未知unicoding编码在Python 2.5的最佳方式 [英] Best way to decode unknown unicoding encoding in Python 2.5

查看:236
本文介绍了解码未知unicoding编码在Python 2.5的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了所有正确的方式吗?无论如何,我解析了很多html,但我不总是知道什么编码它的意思是(令人惊讶的数字说谎)。下面的代码很容易显示我一直在做的,到目前为止,但我相信有一个更好的方法。您的建议将非常感谢。

 导入日志
导入编解码器
从utils.error import错误

类UnicodingError(错误):
pass

#这些编码应该最有可能是为了节省时间
encodings = [ascii, ,cp851,cp855,cp855,bp855,cp855,cp857 cp866,cp865,cp866,cp869,cp874,cp875,cp865,cp865,cp865,cp865,cp857,cp860,cp861,cp862,cp863 cp1255,cp1253,cp1254,cp1255,cp1250,cp932,cp949,
cp950,cp1006,cp1026,cp1140 ,cp1256,cp1257,cp1258,
euc_jp,euc_jis_2004,euc_jisx0213,euc_kr,gb2312,gbk,gb18030,hz iso2022_jp,iso2022_jp_1,iso2022_jp_2,
iso2022_jp_2004,iso2022_jp_3,iso2022_jp_ext,iso2022_kr,latin_1,iso8859_2,iso8859_3,iso8859_4,iso8859_5 ,
iso8859_6,iso8859_7,iso8859_8,iso8859_9,iso8859_10,iso8859_13,iso8859_14,iso8859_15,johab,koi8_r,koi8_u b $ bmac_cyrillic,mac_greek,mac_iceland,mac_latin2,mac_roman,mac_turkish,ptcp154,shift_jis,shift_jis_2004,
shift_jisx0213,utf_32 ,utf_32_be,utf_32_le,utf_16,utf_16_be,utf_16_le,utf_7,utf_8_sig]

def unicode(string):
' make unicode'''
for enc in self.encodings:
try:
logging.debug(unicoder正在尝试+ enc +编码)
utf8 = unicode string,enc)
logging.info(unicoder正在使用+ enc +编码)
return utf8
except UnicodingError:
如果enc == self.encodings [ - 1]:
raise UnicodingError(try do not recognize encoding after attempt do猜。)


解决方案

有两种用于检测未知编码的通用库:





chardet是应该是 Firefox Firefox的方式的端口



您可以使用以下正则表达式从字节字符串检测utf8:

  import re 

utf8_detector = re.compile(r^(?:
[\x09\x0A\x0D\x20 -\x7E]#ASCII
| [\xC2-\xDF] [\x80-\xBF]#non-overlong 2-byte
| \xE0 [\ xA0-\xBF] [\x80-\xBF]#不包括超长
| [\ xE1-\xEC\xEE\xEF] [\x80-\xBF] {2}#直接3字节
| \xED [\x80-\x9F] [\x80-\xBF]#exclude surrogates
| \xF0 [\x90-\xBF] [\ x80-\xBF] {2}#planes 1-3
| [\xF1-\xF3] [\x80-\xBF] {3}#planes 4-15
| \xF4 [\x80-\x8F] [\x80-\xBF] {2}#plane 16
)* $,re.X)

在实践中,如果你在处理英语,我发现以下工作99.9%的时间:



  • 如果它通过上述正则表达式,它是ascii或utf8

  • 如果它包含从0x80-0x9f但不是0xa4的任何字节,它是Windows-1252

  • 如果它包含0xa4,假设它是拉丁语15

  • 否则假定它是拉丁语-1


  • Have I got that all the right way round? Anyway, I am parsing a lot of html, but I don't always know what encoding it's meant to be (a surprising number lie about it). The code below easily shows what I've been doing so far, but I'm sure there's a better way. Your suggestions would be much appreciated.

    import logging
    import codecs
    from utils.error import Error
    
    class UnicodingError(Error):
        pass
    
    # these encodings should be in most likely order to save time
    encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
        "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
        "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
        "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
        "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
        "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
        "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
        "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]
    
    def unicode(string):
        '''make unicode'''
        for enc in self.encodings:
            try:
                logging.debug("unicoder is trying " + enc + " encoding")
                utf8 = unicode(string, enc)
                logging.info("unicoder is using " + enc + " encoding")
                return utf8
            except UnicodingError:
                if enc == self.encodings[-1]:
                    raise UnicodingError("still don't recognise encoding after trying do guess.")
    

    解决方案

    There are two general purpose libraries for detecting unknown encodings:

    chardet is supposed to be a port of the way that firefox does it

    You can use the following regex to detect utf8 from byte strings:

    import re
    
    utf8_detector = re.compile(r"""^(?:
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
      )*$""", re.X)
    

    In practice if you're dealing with English I've found the following works 99.9% of the time:

    1. if it passes the above regex, it's ascii or utf8
    2. if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252
    3. if it contains 0xa4, assume it's latin-15
    4. otherwise assume it's latin-1

    这篇关于解码未知unicoding编码在Python 2.5的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆