检测字符编码 [英] Detect character encoding

查看:100
本文介绍了检测字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,

有没有办法在Python中检测字符串编码?


我需要处理几个文件。它们中的每一个都可以编码为
不同的字符集(iso-8859-2,cp1250等)。我想检测它,并且

将其编码为utf-8(带字符串函数编码)。


感谢您的回答

问候

Michal

解决方案

Michal写道:

你好,
有没有办法在Python中检测字符串编码?

我需要处理几个文件。它们中的每一个都可以编码在不同的字符集中(iso-8859-2,cp1250等)。我想检测它,并将其编码为utf-8(带字符串函数编码)。

感谢您的回答
问候
Michal


检测字符串编码的两种方法是:

(1)提前知道编码

(2)正确猜测


这是Unicode的全部要点 - 一种适用于_lots_

语言的编码。


--Scott David Daniels
sc *********** @ acm .org


Michal写道:

你好,
有没有办法在Python中检测字符串编码?

我需要处理几个文件。它们中的每一个都可以编码在不同的字符集中(iso-8859-2,cp1250等)。我想检测它,并将它编码为utf-8(带字符串函数编码)。




你只能猜测,例如寻找包含例如变形金刚。

Recode在这里可能会有所帮助,它有AFAIK内置的这种启发式方法。


但是_no_方式绝对可以肯定。 8位是8位,因此每个文件

是合法的在所有编码中。

Diez


" Diez B. Roggisch" <德*** @ nospam.web.de>写道:

Michal写道:

有没有办法在Python中检测字符串编码?
我需要处理几个文件。它们中的每一个都可以编码在不同的字符集中(iso-8859-2,cp1250等)。我想检测它,
并将其编码为utf-8(带字符串函数编码)。


但是_no_方式绝对可以肯定。 8位是8位,因此每个
文件都是合法的。在所有编码中。




不完全。有些编码不使用所有有效的8位字符,所以

如果你遇到一个不在编码中的字符,你可以将其从名单中删除

可能的编码。然而,这并不是很有用。

本身。


< mike

-

Mike Meyer< mw*@mired.org> http://www.mired.org/home/mwm/

独立的WWW / Perforce / FreeBSD / Unix顾问,电子邮件以获取更多信息。


Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

解决方案

Michal wrote:

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal


The two ways to detect a string''s encoding are:
(1) know the encoding ahead of time
(2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
sc***********@acm.org


Michal wrote:

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).



You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.
Diez


"Diez B. Roggisch" <de***@nospam.web.de> writes:

Michal wrote:

is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).


But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.



Not quite. Some encodings don''t use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn''t really help much by
itself, though.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.


这篇关于检测字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆