尝试猜测（非unicode）字符串的编码 [英] An attempt at guessing the encoding of a (non-unicode) string

查看：78 发布时间：2019/6/5 14:47:59 python

本文介绍了尝试猜测（非unicode）字符串的编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是一个经常出现的主题。昨晚，我有了

的想法，我想向你反馈。

这可以作为codecs.py中的函数实现（我们称之为

wild_guess，这是基于一些预先计算的数据。这些

预先计算的数据将按如下方式生成：

1.创建字典（关键字：编码，值：<的有效字节集合br $> b $ b编码）

1a。这些套装可以通过反复试验来构建：

def valid_bytes（编码）：

result = set（）

for xrange中的字节（256）：

char = chr（byte）

尝试：

char.decode（encoding）

除了UnicodeDecodeError：

pass

else：

result.add（char）

返回结果<对于每个8位编码，一些代表性编码，每个8位编码。给出了文字（

越长越好）

2a。以下函数是来自其字符串参数的所有两个char $ / b $ b序列的快速生成器。既可用于预先计算数据的生产，也可用于分析

''bild_guess''函数中的给定字符串。

def str_window（文字）：

返回itertools.imap（

text .__ getslice __，xrange（0，len（s）-1），xrange（2，len（s）+1）

）

因此，对于每个编码和代表性文本，一包两个 - char

序列及其频率计算。 {频率[编码] =

dict（关键字：两个字符，值：计数）}

2b。对袋子进行冗长的比较，以便找到最常见的两个字符序列，作为一组，可以认为是特定于

特定编码的唯一。 br />

2c。对于每个编码，只保留一组（在步骤2b中选择）

被称为代表的双字符序列。存储这些

计算的集合加上步骤1a中的那些作为帮助程序中的python代码

模块将从codecs.py导入用于wild_guess函数

（每次添加或修改某些''代表''文本

时重现辅助模块。）

3.编写wild_guess函数

3a。函数''wild_guess''首先从它的

参数构造一个集合：

sample_set = set（参数）

和对步骤1a中的集合的设置操作，我们可以排除

编解码器，其中样本集不是编码有效集的子集。

I不要指望这一步会排除很多编码，但我认为不应该跳过它。

3b。通过str_window函数传递参数，并构造一个

组的所有两个char序列

3c。从步骤2c的所有集合中，找到与3b的交集与

相比最长的一个作为len（交集）/ len（encoding_set）的比率，

和建议相关编码。

您怎么看？除非我有各种编码的代表性文本，否则我无法测试这是否有效。请随时帮助

或bash :)

PS我知道通用的''代表'是什么，以及获得资格的难度/>
这样的一些文本，因此是引号。这就是为什么我说''

越长越好''。

-

TZOTZIOY，我说英格兰最好，

Ils sont fous ces Redmontains！ --Harddix

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let''s call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
''wild_guess'' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and ''representative'' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as ''representative''. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some ''representative'' text is
added or modified).

3. write the wild_guess function

3a. the function ''wild_guess'' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don''t expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can''t test whether that would work unless I have
''representative'' texts for various encodings. Please feel free to help
or bash :)

PS I know how generic ''representative'' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said ''the
longer, the better''.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix

尝试猜测（非unicode）字符串的编码 [英] An attempt at guessing the encoding of a (non-unicode) string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

尝试猜测（非unicode）字符串的编码 [英] An attempt at guessing the encoding of a (non-unicode) string

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭