尝试猜测(非unicode)字符串的编码 [英] An attempt at guessing the encoding of a (non-unicode) string
问题描述
这是一个经常出现的主题。昨晚,我有了
的想法,我想向你反馈。
这可以作为codecs.py中的函数实现(我们称之为
wild_guess,这是基于一些预先计算的数据。这些
预先计算的数据将按如下方式生成:
1.创建字典(关键字:编码,值:<的有效字节集合br $> b $ b编码)
1a。这些套装可以通过反复试验来构建:
def valid_bytes(编码):
result = set()
for xrange中的字节(256):
char = chr(byte)
尝试:
char.decode(encoding)
除了UnicodeDecodeError:
pass
else:
result.add(char)
返回结果<对于每个8位编码,一些代表性编码,每个8位编码。给出了文字(
越长越好)
2a。以下函数是来自其字符串参数的所有两个char $ / b $ b序列的快速生成器。既可用于预先计算数据的生产,也可用于分析
''bild_guess''函数中的给定字符串。
def str_window(文字):
返回itertools.imap(
text .__ getslice __,xrange(0,len(s)-1) ,xrange(2,len(s)+1)
)
因此,对于每个编码和代表性文本,一包两个 - char
序列及其频率计算。 {频率[编码] =
dict(关键字:两个字符,值:计数)}
2b。对袋子进行冗长的比较,以便找到最常见的两个字符序列,作为一组,可以认为是特定于
特定编码的唯一。 br />
2c。对于每个编码,只保留一组(在步骤2b中选择)
被称为代表的双字符序列。存储这些
计算的集合加上步骤1a中的那些作为帮助程序中的python代码
模块将从codecs.py导入用于wild_guess函数
(每次添加或修改某些''代表''文本
时重现辅助模块。)
3.编写wild_guess函数
3a。函数''wild_guess''首先从它的
参数构造一个集合:
sample_set = set(参数)
和对步骤1a中的集合的设置操作,我们可以排除
编解码器,其中样本集不是编码有效集的子集。
I不要指望这一步会排除很多编码,但我认为不应该跳过它。
3b。通过str_window函数传递参数,并构造一个
组的所有两个char序列
3c。从步骤2c的所有集合中,找到与3b的交集与
相比最长的一个作为len(交集)/ len(encoding_set)的比率,
和建议相关编码。
您怎么看?除非我有各种编码的代表性文本,否则我无法测试这是否有效。请随时帮助
或bash :)
PS我知道通用的''代表'是什么,以及获得资格的难度/>
这样的一些文本,因此是引号。这就是为什么我说''
越长越好''。
-
TZOTZIOY,我说英格兰最好,
Ils sont fous ces Redmontains! --Harddix
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.
This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:
1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)
1a. the sets can be constructed by trial and error:
def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result
2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)
2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
''wild_guess'' function.
def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)
So for every encoding and ''representative'' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}
2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.
2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as ''representative''. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some ''representative'' text is
added or modified).
3. write the wild_guess function
3a. the function ''wild_guess'' would first construct a set from its
argument:
sample_set= set(argument)
and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don''t expect that this step would exclude many encodings, but I think
it should not be skipped.
3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies
3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.
What do you think? I can''t test whether that would work unless I have
''representative'' texts for various encodings. Please feel free to help
or bash :)
PS I know how generic ''representative'' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said ''the
longer, the better''.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
推荐答案
Christos TZOTZIOY Georgiou写道:
Christos TZOTZIOY Georgiou wrote:
这是一个经常出现的主题。昨晚,我有了以下想法,我想向你反馈。
这可以作为codecs.py中的一个函数实现(让我们称之为
wild_guess),它基于一些预先计算的数据。这些
预先计算的数据将按如下方式生成:
....您如何看待?除非我有各种编码的代表性文本,否则我无法测试这是否有效。请随时帮助
或bash:)
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.
This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: .... What do you think? I can''t test whether that would work unless I have
''representative'' texts for various encodings. Please feel free to help
or bash :)
代表性文本在某些圈子中称为培训
语料库。有关可能有用的一些模块,请参阅自然语言工具包。
您将此方法原型化:
< http://nltk.sf.net/>
特别要查看概率教程。
The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:
<http://nltk.sf.net/>
In particular, check out the probability tutorial.
2004年4月2日星期五15:05:42 GMT,谣言说Jon Willeke
< j。*********** @ verizon.dot.net>可能写得:
On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
<j.***********@verizon.dot.net> might have written:
Christos TZOTZIOY Georgiou写道:
< snip>
Christos TZOTZIOY Georgiou wrote: <snip>
这可以作为codecs.py中的函数实现(让我们称之为
wild_guess),这是基于一些预先计算的数据。这些
预先计算的数据将按如下方式生成:
This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:
...
...
< snip>
[Jon]代表性文本在某些圈子中称为训练
语料库。有关可能有助于您使用此方法原型的一些模块,请参阅自然语言工具包:
< http://nltk.sf.net/>
特别是,查看概率教程。
<snip>
[Jon]The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:
<http://nltk.sf.net/>
In particular, check out the probability tutorial.
感谢提示,我正在浏览文档。但是,
我想创建一些不依赖于外部
python库的东西,这样任何有兴趣的人都会下载一个小的
模块可以完成这项工作,希望很好。
-
TZOTZIOY,我说英格兰最好,
Ils sont fous ces Redmontains! --Harddix
Thanks for the hint, and I am browsing the documentation now. However,
I''d like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
我通过一种更简单的方法得到了不错的结果:
计算编码产生的字符数一个符号
c为c.isalpha()或c.isspace(),如果使用编码导致UnicodeDecodeError,则减去一个大的惩罚,并采取编码
计数最多。
-
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ。加州,欧文,信息学院和计算机科学
I''ve been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
这篇关于尝试猜测(非unicode)字符串的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!