在Python中编码检测库 [英] Encoding detection library in python
问题描述
这与我这里的问题有关。
我处理大量的文本(在HTML和XML主要)通过HTTP获取。我在python寻找一个库,可以做智能编码检测基于不同的策略,并使用最好的字符编码猜测转换文本unicode。
I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.
我发现< a href =http://chardet.feedparser.org =nofollow noreferrer> chardet 执行自动检测非常好。然而,自动检测一切都是问题,因为它是慢,非常反对所有标准。按 chardet
常见问题我不想违反标准。
I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet
FAQ I don't want to screw the standards.
在这里的常见问题列表是我想要查找编码的地方:
From the same FAQ here is the list of places where I want to look for encoding:
- HTTP
Content-type
标头中的charset参数。 -
< meta http-equiv =content-type>
元素
< head>
- 编码XML
文档的XML序列中的HTML属性的网页。 - 自动检测字符编码作为最后的手段。
- charset parameter in HTTP
Content-type
header. <meta http-equiv="content-type">
element in the<head>
of a web page for HTML documents.- encoding attribute in the XML prolog for XML documents.
- Auto-detect the character encoding as a last resort.
基本上,我想要能够查看所有这些地方,
Basically I want to be able to look in all those place and also deal with conflicting information automatically.
有没有这样的库,或者我需要自己写吗?
Is there such library out there or do I need to write it myself?
推荐答案
BeautifulSoup(html解析器)包含一个名为UnicodeDammit的类,它只是这样。看看,如果你喜欢它。
BeautifulSoup (the html parser) incorporates a class called UnicodeDammit that does just that. Have a look and see if you like it.
这篇关于在Python中编码检测库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!