使用通用编码检测器(chardet)在Python中的文本文件中进行字符检测 [英] Character detection in a text file in Python using the Universal Encoding Detector (chardet)
问题描述
我想在Python中使用通用编码检测器(chardet)来检测文本文件('infile')中最有可能的字符编码,并在后续处理中使用它。
I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.
虽然chardet主要用于检测网页的字符编码,但我发现了一个示例用于单个文本文件。
While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.
但是,我不能工作了如何告诉脚本设置最可能的字符编码到变量'charenc'(它在整个脚本中使用了几次)。
However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).
我的代码,基于上述示例和chardet自己的文档的组合如下:
My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:
import chardet
rawdata=open(infile,"r").read()
chardet.detect(rawdata)
字符检测是必要的,因为脚本继续运行以下几个):
Character detection is necessary as the script goes on to run the following (as well as several similar uses):
inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()
任何帮助将非常感激。
Any help would be greatly appreciated.
推荐答案
chardet.detect
返回一个字典,键'encoding'
。所以你可以这样做:
chardet.detect
returns a dictionary which provides the encoding as the value associated with the key 'encoding'
. So you can do this:
import chardet
rawdata = open(infile, "r").read()
result = chardet.detect(rawdata)
charenc = result['encoding']
这篇关于使用通用编码检测器(chardet)在Python中的文本文件中进行字符检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!