尝试猜测(非unicode)字符串的编码 [英] An attempt at guessing the encoding of a (non-unicode) string

查看:78
本文介绍了尝试猜测(非unicode)字符串的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个经常出现的主题。昨晚,我有了

的想法,我想向你反馈。


这可以作为codecs.py中的函数实现(我们称之为

wild_guess,这是基于一些预先计算的数据。这些

预先计算的数据将按如下方式生成:


1.创建字典(关键字:编码,值:<的有效字节集合br $> b $ b编码)


1a。这些套装可以通过反复试验来构建:


def valid_bytes(编码):

result = set()

for xrange中的字节(256):

char = chr(byte)

尝试:

char.decode(encoding)

除了UnicodeDecodeError:

pass

else:

result.add(char)

返回结果<对于每个8位编码,一些代表性编码,每个8位编码。给出了文字(

越长越好)


2a。以下函数是来自其字符串参数的所有两个char $ / b $ b序列的快速生成器。既可用于预先计算数据的生产,也可用于分析

''bild_guess''函数中的给定字符串。


def str_window(文字):

返回itertools.imap(

text .__ getslice __,xrange(0,len(s)-1) ,xrange(2,len(s)+1)




因此,对于每个编码和代表性文本,一包两个 - char

序列及其频率计算。 {频率[编码] =

dict(关键字:两个字符,值:计数)}


2b。对袋子进行冗长的比较,以便找到最常见的两个字符序列,作为一组,可以认为是特定于

特定编码的唯一。 br />

2c。对于每个编码,只保留一组(在步骤2b中选择)

被称为代表的双字符序列。存储这些

计算的集合加上步骤1a中的那些作为帮助程序中的python代码

模块将从codecs.py导入用于wild_guess函数

(每次添加或修改某些''代表''文本

时重现辅助模块。)


3.编写wild_guess函数


3a。函数''wild_guess''首先从它的

参数构造一个集合:


sample_set = set(参数)


和对步骤1a中的集合的设置操作,我们可以排除

编解码器,其中样本集不是编码有效集的子集。

I不要指望这一步会排除很多编码,但我认为不应该跳过它。


3b。通过str_window函数传递参数,并构造一个

组的所有两个char序列


3c。从步骤2c的所有集合中,找到与3b的交集与

相比最长的一个作为len(交集)/ len(encoding_set)的比率,

和建议相关编码。


您怎么看?除非我有各种编码的代表性文本,否则我无法测试这是否有效。请随时帮助

或bash :)


PS我知道通用的''代表'是什么,以及获得资格的难度/>
这样的一些文本,因此是引号。这就是为什么我说''

越长越好''。

-

TZOTZIOY,我说英格兰最好,

Ils sont fous ces Redmontains! --Harddix

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
''wild_guess'' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and ''representative'' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as ''representative''. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some ''representative'' text is
added or modified).

3. write the wild_guess function

3a. the function ''wild_guess'' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don''t expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can''t test whether that would work unless I have
''representative'' texts for various encodings. Please feel free to help
or bash :)

PS I know how generic ''representative'' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said ''the
longer, the better''.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

推荐答案

Christos TZOTZIOY Georgiou写道:
Christos TZOTZIOY Georgiou wrote:
这是一个经常出现的主题。昨晚,我有了以下想法,我想向你反馈。

这可以作为codecs.py中的一个函数实现(让我们称之为
wild_guess),它基于一些预先计算的数据。这些
预先计算的数据将按如下方式生成:
....您如何看待?除非我有各种编码的代表性文本,否则我无法测试这是否有效。请随时帮助
或bash:)
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: .... What do you think? I can''t test whether that would work unless I have
''representative'' texts for various encodings. Please feel free to help
or bash :)




代表性文本在某些圈子中称为培训

语料库。有关可能有用的一些模块,请参阅自然语言工具包。

您将此方法原型化:


< http://nltk.sf.net/>


特别要查看概率教程。



The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.


2004年4月2日星期五15:05:42 GMT,谣言说Jon Willeke

< j。*********** @ verizon.dot.net>可能写得:
On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
<j.***********@verizon.dot.net> might have written:
Christos TZOTZIOY Georgiou写道:
< snip>
Christos TZOTZIOY Georgiou wrote: <snip>

这可以作为codecs.py中的函数实现(让我们称之为
wild_guess),这是基于一些预先计算的数据。这些
预先计算的数据将按如下方式生成:

This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:


...


...



< snip>


[Jon]代表性文本在某些圈子中称为训练
语料库。有关可能有助于您使用此方法原型的一些模块,请参阅自然语言工具包:

< http://nltk.sf.net/>

特别是,查看概率教程。


<snip>

[Jon]The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.




感谢提示,我正在浏览文档。但是,

我想创建一些不依赖于外部

python库的东西,这样任何有兴趣的人都会下载一个小的

模块可以完成这项工作,希望很好。

-

TZOTZIOY,我说英格兰最好,

Ils sont fous ces Redmontains! --Harddix



Thanks for the hint, and I am browsing the documentation now. However,
I''d like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix


我通过一种更简单的方法得到了不错的结果:

计算编码产生的字符数一个符号

c为c.isalpha()或c.isspace(),如果使用编码导致UnicodeDecodeError,则减去一个大的惩罚,并采取编码

计数最多。


-

David Eppstein http://www.ics.uci.edu/~eppstein/

Univ。加州,欧文,信息学院和计算机科学
I''ve been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science


这篇关于尝试猜测(非unicode)字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆