使用通用编码检测器(chardet)在Python中的文本文件中进行字符检测 [英] Character detection in a text file in Python using the Universal Encoding Detector (chardet)

查看:249
本文介绍了使用通用编码检测器(chardet)在Python中的文本文件中进行字符检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Python中使用通用编码检测器(chardet)来检测文本文件('infile')中最有可能的字符编码,并在后续处理中使用它。

I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.

虽然chardet主要用于检测网页的字符编码,但我发现了一个示例用于单个文本文件。

While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.

但是,我不能工作了如何告诉脚本设置最可能的字符编码到变量'charenc'(它在整个脚本中使用了几次)。

However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).

我的代码,基于上述示例和chardet自己的文档的组合如下:

My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:

import chardet    
rawdata=open(infile,"r").read()
chardet.detect(rawdata)

字符检测是必要的,因为脚本继续运行以下几个):

Character detection is necessary as the script goes on to run the following (as well as several similar uses):

inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()

任何帮助将非常感激。

Any help would be greatly appreciated.

推荐答案

chardet.detect 返回一个字典,键'encoding'。所以你可以这样做:

chardet.detect returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:

import chardet    
rawdata = open(infile, "r").read()
result = chardet.detect(rawdata)
charenc = result['encoding']

这篇关于使用通用编码检测器(chardet)在Python中的文本文件中进行字符检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆