确定Python中的文本编码 [英] Determine the encoding of text in Python

查看:133
本文介绍了确定Python中的文本编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到一些编码的文本,但是我不知道使用了什么字符集。有没有办法使用Python来确定文本文件的编码? 如何检测到的编码/代码页文本文件处理C#。

I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.

推荐答案

正确检测编码是不可能

(从chardet常见问题解答:)

(From chardet FAQ:)


然而,一些编码被优化
为特定语言,而语言
不是随机的。一些字符
序列一直弹出,而
其他序列没有任何意义。 A
流利的英语人员打开
报纸,并发现txzqJv 2!dasd0a
QqdKjvz将立即识别出不是英文的
(即使是
完全由英文字母组成)。
通过研究很多典型的文本,一个
的计算机算法可以模拟这个
的流畅度,并且使一个有教育的
猜测文本的语言。

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

chardet 使用该研究尝试检测编码的库。 chardet是Mozilla中自动检测代码的端口。

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

您还可以使用 UnicodeDammit 。它将尝试以下方法:

You can also use UnicodeDammit. It will try the following methods:


  • 在文档本身中发现的编码:例如,在XML声明中或(对于HTML文档)一个http-equiv META标签。如果Beautiful Soup在文档中找到这样的编码,那么它将从头开始重新解析文档,并给出新的编码。唯一的例外是如果你明确地指定一个编码,并且该编码实际上是有效的:那么它将忽略在文档中找到的任何编码。

  • 通过查看前几个字节嗅探的编码的文件。如果在此阶段检测到编码,则它将是UTF- *编码,EBCDIC或ASCII之一。

  • chardet 图书馆,如果您已安装。

  • UTF-8

  • Windows-1252

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

这篇关于确定Python中的文本编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆