如何确定文本的编码? [英] How to determine the encoding of text?

查看:117
本文介绍了如何确定文本的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了一些经过编码的文本,但是我不知道使用了什么字符集。有没有办法使用Python确定文本文件的编码? >如何检测广告的编码/代码页文本文件处理C#。

I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.

推荐答案

始终正确检测编码是不可能

Correctly detecting the encoding all times is impossible.

(来自chardet常见问题解答:)

(From chardet FAQ:)


但是,某些编码已经过优化
用于特定语言,而语言
不是随机的。某些字符
序列始终弹出,而
其他序列毫无意义。一个
的英语流利者打开了
的报纸,发现 txzqJv 2!dasd0a
QqdKjvz会立即识别出不是英语的
(即使是
完全由英文字母组成)。
通过研究大量的典型文本,
计算机算法可以模拟
的这种流利度,并使受过教育的
猜测文本的语言。

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

chardet 使用该研究尝试检测编码的库。 chardet是Mozilla中自动检测代码的端口。

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

您还可以使用 UnicodeDammit 。它将尝试以下方法:

You can also use UnicodeDammit. It will try the following methods:


  • 在文档本身中发现的编码:例如,在XML声明中或(对于HTML文档)一个http等效的META标签。如果Beautiful Soup在文档中找到这种编码,它将从头开始再次解析该文档,然后尝试使用新的编码。唯一的例外是,如果您明确指定了一种编码,并且该编码确实起作用:那么它将忽略在文档中找到的任何编码。

  • 通过查看前几个字节来嗅探编码文件。如果在此阶段检测到编码,则它将是UTF- *编码,EBCDIC或ASCII之一。

  • chardet 库(如果已安装)。

  • UTF-8

  • Windows-1252

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

这篇关于如何确定文本的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆