在Python中编码检测库 [英] Encoding detection library in python

查看：102 发布时间：2016/11/19 14:57:56 python html xml http character-encoding

本文介绍了在Python中编码检测库的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这与我这里的问题有关。

我处理大量的文本（在HTML和XML主要）通过HTTP获取。我在python寻找一个库，可以做智能编码检测基于不同的策略，并使用最好的字符编码猜测转换文本unicode。

I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.

我发现< a href =http://chardet.feedparser.org =nofollow noreferrer> chardet 执行自动检测非常好。然而，自动检测一切都是问题，因为它是慢，非常反对所有标准。按 chardet 常见问题我不想违反标准。

I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet FAQ I don't want to screw the standards.

在这里的常见问题列表是我想要查找编码的地方：

From the same FAQ here is the list of places where I want to look for encoding:

HTTP Content-type 标头中的charset参数。

< meta http-equiv =content-type> 元素
< head>

编码XML
文档的XML序列中的HTML属性的网页。

自动检测字符编码作为最后的手段。

charset parameter in HTTP Content-type header.
<meta http-equiv="content-type"> element in the <head> of a web page for HTML documents.
encoding attribute in the XML prolog for XML documents.
Auto-detect the character encoding as a last resort.

基本上，我想要能够查看所有这些地方，

Basically I want to be able to look in all those place and also deal with conflicting information automatically.

有没有这样的库，或者我需要自己写吗？

Is there such library out there or do I need to write it myself?

在Python中编码检测库 [英] Encoding detection library in python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Python中编码检测库 [英] Encoding detection library in python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭