检测编码错误的UTF-8文本文件中的编码 [英] Detect encoding in wrongly encoded UTF-8 text file

查看：136 发布时间：2021/5/4 19:20:26 python encoding

本文介绍了检测编码错误的UTF-8文本文件中的编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了编码问题.

我有一个语言数据科学项目需要解析的数百万个文本文件.每个文本文件都编码为UTF-8，但是我发现其中的某些源文件编码不正确.

I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly.

例如.我有一个中文文本文件，编码为UTF-8，但文件中的文本如下所示:

For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this:

Subject: »Ø¸´: ÎÒÉý¼¶µ½

当我使用Python检测此中文文本文件的编码时:

When I use Python to detect the encoding of this Chinese text file:

Chardet告诉我该文件编码为UTF-8:

Chardet tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding=chardet.detect(data)['encoding']

UnicodeDammit还告诉我文件编码为UTF-8:

UnicodeDammit also tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding= UnicodeDammit(data).original_encoding

同时，我知道它不是UTF-8，而应该是GB2312中文编码.如果我在Notepad ++中打开此文件，它也会被检测为UTF-8，并且所有汉字都显示为乱码.仅当我将Notepad ++中的编码手动切换为GB2312时，我才获得正确的文本:

Meanwhile, I know it's not UTF-8, it should be GB2312 Chinese encoding instead. If I open this file in Notepad++, it's detected also as UTF-8 and all Chinese characters show as gibberish. Only if I manually switch encoding in Notepad++ to GB2312 I get the proper text:

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

我有许多使用各种语言的文件.

I have a number of files like this, in all kinds of languages.

有没有一种方法可以检测这些编码错误的UTF-8文件中的编码?

Is there a way I can detect encoding in these badly encoded UTF-8 files?

示例文本文件可在此处下载: https://gofile.io/d/qMcgkt

Example text file can be downloaded here: https://gofile.io/d/qMcgkt

推荐答案

最终，我知道了.使用CharsetNormalizerMatches似乎可以正常检测编码.无论如何，这就是我的实现方式，它就像一个超级按钮一样工作，可以正确地检测出相关文件的gb18030编码:

Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding

注意:有人建议使用CharsetNormalizerMatches，但有人在此提示了我的答案，但后来在这里删除了他的帖子.太可惜了，我很想给他/她功劳.

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.

这篇关于检测编码错误的UTF-8文本文件中的编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检测编码错误的UTF-8文本文件中的编码 [英] Detect encoding in wrongly encoded UTF-8 text file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检测编码错误的UTF-8文本文件中的编码 [英] Detect encoding in wrongly encoded UTF-8 text file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭