在Python中编码检测库 [英] Encoding detection library in python

查看:102
本文介绍了在Python中编码检测库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与我这里的问题有关。

我处理大量的文本(在HTML和XML主要)通过HTTP获取。我在python寻找一个库,可以做智能编码检测基于不同的策略,并使用最好的字符编码猜测转换文本unicode。

I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.

我发现< a href =http://chardet.feedparser.org =nofollow noreferrer> chardet 执行自动检测非常好。然而,自动检测一切都是问题,因为它是慢,非常反对所有标准。按 chardet 常见问题我不想违反标准。

I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet FAQ I don't want to screw the standards.

在这里的常见问题列表是我想要查找编码的地方:

From the same FAQ here is the list of places where I want to look for encoding:


  • HTTP Content-type 标头中的charset参数。

  • < meta http-equiv =content-type> 元素
    < head>

  • 编码XML
    文档的XML序列中的HTML属性的网页。

  • 自动检测字符编码作为最后的手段。

  • charset parameter in HTTP Content-type header.
  • <meta http-equiv="content-type"> element in the <head> of a web page for HTML documents.
  • encoding attribute in the XML prolog for XML documents.
  • Auto-detect the character encoding as a last resort.

基本上,我想要能够查看所有这些地方,

Basically I want to be able to look in all those place and also deal with conflicting information automatically.

有没有这样的库,或者我需要自己写吗?

Is there such library out there or do I need to write it myself?

推荐答案

BeautifulSoup(html解析器)包含一个名为UnicodeDammit的类,它只是这样。看看,如果你喜欢它。

BeautifulSoup (the html parser) incorporates a class called UnicodeDammit that does just that. Have a look and see if you like it.

这篇关于在Python中编码检测库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆