解码未知unicoding编码在Python 2.5的最佳方式 [英] Best way to decode unknown unicoding encoding in Python 2.5

查看：236 发布时间：2016/11/19 13:18:19 python html unicode encoding character-encoding

本文介绍了解码未知unicoding编码在Python 2.5的最佳方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我得到了所有正确的方式吗？无论如何，我解析了很多html，但我不总是知道什么编码它的意思是（令人惊讶的数字说谎）。下面的代码很容易显示我一直在做的，到目前为止，但我相信有一个更好的方法。您的建议将非常感谢。

 导入日志
导入编解码器
从utils.error import错误
 
类UnicodingError（错误）：
 pass 
 
＃这些编码应该最有可能是为了节省时间
 encodings = [ascii， ，cp851，cp855，cp855，bp855，cp855，cp857 cp866，cp865，cp866，cp869，cp874，cp875，cp865，cp865，cp865，cp865，cp857，cp860，cp861，cp862，cp863 cp1255，cp1253，cp1254，cp1255，cp1250，cp932，cp949，
cp950，cp1006，cp1026，cp1140 ，cp1256，cp1257，cp1258，
euc_jp，euc_jis_2004，euc_jisx0213，euc_kr，gb2312，gbk，gb18030，hz iso2022_jp，iso2022_jp_1，iso2022_jp_2，
iso2022_jp_2004，iso2022_jp_3，iso2022_jp_ext，iso2022_kr，latin_1，iso8859_2，iso8859_3，iso8859_4，iso8859_5 ，
iso8859_6，iso8859_7，iso8859_8，iso8859_9，iso8859_10，iso8859_13，iso8859_14，iso8859_15，johab，koi8_r，koi8_u b $ bmac_cyrillic，mac_greek，mac_iceland，mac_latin2，mac_roman，mac_turkish，ptcp154，shift_jis，shift_jis_2004，
shift_jisx0213，utf_32 ，utf_32_be，utf_32_le，utf_16，utf_16_be，utf_16_le，utf_7，utf_8_sig] 
 
 def unicode（string）：
' make unicode'''
 for enc in self.encodings：
 try：
 logging.debug（unicoder正在尝试+ enc +编码）
 utf8 = unicode string，enc）
 logging.info（unicoder正在使用+ enc +编码）
 return utf8 
 except UnicodingError：
如果enc == self.encodings [ - 1]：
 raise UnicodingError（try do not recognize encoding after attempt do猜。）

解决方案

有两种用于检测未知编码的通用库：

chardet Universal Feed Parser

UnicodeDammit是美丽的汤

chardet是应该是 Firefox Firefox的方式的端口

您可以使用以下正则表达式从字节字符串检测utf8：

  import re 
 
 utf8_detector = re.compile（r^（?: 
 [\x09\x0A\x0D\x20 -\x7E]＃ASCII 
 | [\xC2-\xDF] [\x80-\xBF]＃non-overlong 2-byte 
 | \xE0 [\ xA0-\xBF] [\x80-\xBF]＃不包括超长
 | [\ xE1-\xEC\xEE\xEF] [\x80-\xBF] {2}＃直接3字节
 | \xED [\x80-\x9F] [\x80-\xBF]＃exclude surrogates 
 | \xF0 [\x90-\xBF] [\ x80-\xBF] {2}＃planes 1-3 
 | [\xF1-\xF3] [\x80-\xBF] {3}＃planes 4-15 
 | \xF4 [\x80-\x8F] [\x80-\xBF] {2}＃plane 16 
）* $，re.X）

在实践中，如果你在处理英语，我发现以下工作99.9％的时间：

如果它通过上述正则表达式，它是ascii或utf8

如果它包含从0x80-0x9f但不是0xa4的任何字节，它是Windows-1252

如果它包含0xa4，假设它是拉丁语15

否则假定它是拉丁语-1

Have I got that all the right way round? Anyway, I am parsing a lot of html, but I don't always know what encoding it's meant to be (a surprising number lie about it). The code below easily shows what I've been doing so far, but I'm sure there's a better way. Your suggestions would be much appreciated.

import logging
import codecs
from utils.error import Error

class UnicodingError(Error):
    pass

# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", 
    "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", 
    "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", 
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", 
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", 
    "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", 
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", 
    "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]

def unicode(string):
    '''make unicode'''
    for enc in self.encodings:
        try:
            logging.debug("unicoder is trying " + enc + " encoding")
            utf8 = unicode(string, enc)
            logging.info("unicoder is using " + enc + " encoding")
            return utf8
        except UnicodingError:
            if enc == self.encodings[-1]:
                raise UnicodingError("still don't recognise encoding after trying do guess.")

解决方案

There are two general purpose libraries for detecting unknown encodings:

chardet, part of Universal Feed Parser
UnicodeDammit, part of Beautiful Soup

chardet is supposed to be a port of the way that firefox does it

You can use the following regex to detect utf8 from byte strings:

import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)

In practice if you're dealing with English I've found the following works 99.9% of the time:

if it passes the above regex, it's ascii or utf8
if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252
if it contains 0xa4, assume it's latin-15
otherwise assume it's latin-1

这篇关于解码未知unicoding编码在Python 2.5的最佳方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解码未知unicoding编码在Python 2.5的最佳方式 [英] Best way to decode unknown unicoding encoding in Python 2.5

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

解码未知unicoding编码在Python 2.5的最佳方式 [英] Best way to decode unknown unicoding encoding in Python 2.5

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭