检测编码错误的UTF-8文本文件中的编码 [英] Detect encoding in wrongly encoded UTF-8 text file

查看:136
本文介绍了检测编码错误的UTF-8文本文件中的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了编码问题.

我有一个语言数据科学项目需要解析的数百万个文本文件.每个文本文件都编码为UTF-8,但是我发现其中的某些源文件编码不正确.

I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly.

例如.我有一个中文文本文件,编码为UTF-8,但文件中的文本如下所示:

For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this:

Subject: »Ø¸´: ÎÒÉý¼¶µ½

当我使用Python检测此中文文本文件的编码时:

When I use Python to detect the encoding of this Chinese text file:

Chardet告诉我该文件编码为UTF-8:

Chardet tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding=chardet.detect(data)['encoding']

UnicodeDammit还告诉我文件编码为UTF-8:

UnicodeDammit also tells me the file is encoded as UTF-8:

with open(path,'rb') as f:
    data = ""
    data = f.read()
    encoding= UnicodeDammit(data).original_encoding

同时,我知道它不是UTF-8,而应该是GB2312中文编码.如果我在Notepad ++中打开此文件,它也会被检测为UTF-8,并且所有汉字都显示为乱码.仅当我将Notepad ++中的编码手动切换为GB2312时,我才获得正确的文本:

Meanwhile, I know it's not UTF-8, it should be GB2312 Chinese encoding instead. If I open this file in Notepad++, it's detected also as UTF-8 and all Chinese characters show as gibberish. Only if I manually switch encoding in Notepad++ to GB2312 I get the proper text:

Subject: 禄脴赂麓: 脦脪脡媒录露碌陆

我有许多使用各种语言的文件.

I have a number of files like this, in all kinds of languages.

有没有一种方法可以检测这些编码错误的UTF-8文件中的编码?

Is there a way I can detect encoding in these badly encoded UTF-8 files?

示例文本文件可在此处下载: https://gofile.io/d/qMcgkt

Example text file can be downloaded here: https://gofile.io/d/qMcgkt

推荐答案

最终,我知道了.使用CharsetNormalizerMatches似乎可以正常检测编码.无论如何,这就是我的实现方式,它就像一个超级按钮一样工作,可以正确地检测出相关文件的gb18030编码:

Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding

注意:有人建议使用CharsetNormalizerMatches,但有人在此提示了我的答案,但后来在这里删除了他的帖子.太可惜了,我很想给他/她功劳.

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.

这篇关于检测编码错误的UTF-8文本文件中的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆