使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符 [英] Character detection in a text file in Python using the Universal Encoding Detector (chardet)

查看:33
本文介绍了使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 中的通用编码检测器 (chardet) 来检测文本文件 ('infile') 中最可能的字符编码,并将其用于进一步处理.

I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.

虽然 chardet 主要用于检测网页的字符编码,但我发现了一个 示例 用于单个文本文件.

While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.

但是,我不知道如何告诉脚本将最可能的字符编码设置为变量charenc"(在整个脚本中多次使用).

However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).

基于上述示例和 chardet 自己的 文档 的组合,我的代码是如下:

My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:

import chardet    
rawdata=open(infile,"r").read()
chardet.detect(rawdata)

字符检测是必要的,因为脚本继续运行以下(以及几个类似的用途):

Character detection is necessary as the script goes on to run the following (as well as several similar uses):

inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()

任何帮助将不胜感激.

推荐答案

chardet.detect() 返回一个字典,该字典提供作为与键 'encoding'<关联的值的编码/代码>.所以你可以这样做:

chardet.detect() returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:

import chardet    
rawdata = open(infile, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']

chardet 文档没有明确说明关于文本字符串和/或字节字符串是否应该与模块一起使用,但有理由认为,如果您有文本字符串,则不需要对其运行字符检测,因此您可能应该传递字节字符串.因此,在对 open() 的调用中使用了二进制模式标志 (b).但是 chardet.detect() 也可能使用文本字符串,具体取决于您使用的 Python 版本和库的版本,即如果您省略了 b可能会发现它无论如何都有效,即使您在技术上做错了.

The chardet documentation is not explicitly clear about whether text strings and/or byte strings are supposed to work with the module, but it stands to reason that if you have a text string you don't need to run character detection on it, so you should probably be passing byte strings. Hence the binary mode flag (b) in the call to open(). But chardet.detect() might also work with a text string depending on which versions of Python and of the library you're using, i.e. if you do omit the b you might find that it works anyway even though you're technically doing something wrong.

这篇关于使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆