获取 Python 可以编码的所有编码的列表 [英] Get a list of all the encodings Python can encode to

查看:28
本文介绍了获取 Python 可以编码的所有编码的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本,该脚本将尝试在 Python 2.6 中将字节编码为多种不同的编码.有什么方法可以获得我可以迭代的可用编码列表吗?

我尝试这样做的原因是因为用户有一些未正确编码的文本.有搞笑的角色.我知道弄乱它的 unicode 字符.我希望能够给他们一个答案,例如您的文本编辑器将该字符串解释为 X 编码,而不是 Y 编码".我想我会尝试使用一种编码对该字符进行编码,然后使用另一种编码再次对其进行解码,看看我们是否得到相同的字符序列.

即像这样:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):尝试:unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)除了:经过

解决方案

不幸的是 encodings.aliases.aliases.keys() 不是一个合适的答案.

aliases(正如人们所期望的那样)包含几种不同的键映射到相同值的情况,例如1252windows_1252 都映射到 cp1252.如果您使用 set(aliases.values()) 而不是 aliases.keys(),则可以节省时间.

但有一个更糟糕的问题:aliases 不包含没有别名的编解码器(如 cp856、cp874、cp875、cp737 和 koi8_u).

<预><代码>>>>from encodings.aliases 导入别名>>>定义查找(q):... return [(k,v) for k, v in aliases.items() if q in k or q in v]...>>>find('1252') # 多个别名[('1252', 'cp1252'), ('windows_1252', 'cp1252')]>>>find('856') # 别名中没有代码页 856[]>>>find('koi8') # 别名中没有 koi8_u[('cskoi8r', 'koi8_r')]>>>'x'.decode('cp856') # 但 cp856 是一个有效的编解码器你'x'>>>'x'.decode('koi8_u') # 但 koi8_u 是一个有效的编解码器你'x'>>>

还值得注意的是,无论您获得了完整的编解码器列表,忽略与编码/解码字符集无关的编解码器可能是个好主意,而是进行一些其他转换,例如zlibquopribase64.

这让我们想到了为什么要尝试将字节编码为许多不同的编码"的问题.如果我们知道这一点,我们或许能够引导您朝着正确的方向前进.

首先,这是模棱两可的.一种将字节解码为 un​​icode,一种将 unicode 编码为字节.你想做什么?

您真正想要实现的是:您是否正在尝试确定使用哪个编解码器来解码某些传入字节,并计划使用所有可能的编解码器进行尝试?[注意:latin1 将解码任何内容] 您是否试图通过尝试使用所有可能的编解码器对其进行编码来确定某些 unicode 文本的语言?[注意:utf8 可以编码任何东西].

I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?

The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.

i.e. something like this:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
  try:
    unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
  except:
    pass

解决方案

Unfortunately encodings.aliases.aliases.keys() is NOT an appropriate answer.

aliases(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252 and windows_1252 are both mapped to cp1252. You could save time if instead of aliases.keys() you use set(aliases.values()).

BUT THERE'S A WORSE PROBLEM: aliases doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).

>>> from encodings.aliases import aliases
>>> def find(q):
...     return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>

It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib, quopri, and base64.

Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.

For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?

What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].

这篇关于获取 Python 可以编码的所有编码的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆