获取Python可以编码的所有编码的列表 [英] Get a list of all the encodings Python can encode to

查看:213
本文介绍了获取Python可以编码的所有编码的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写一个脚本,将Python 2.6中的字节编码为许多不同的编码。是否有某种方法可以获得可用的编码列表,我可以迭代?



我试图这样做的原因是因为用户有一些文本未正确编码。有有趣的人物。我知道unicode字符,搞砸了。我想能够给他们一个答案,你的文本编辑器将该字符串解释为X编码,而不是Y编码。我想我会尝试使用一种编码对该字符进行编码,然后使用另一种编码再次解码,看看我们是否得到相同的字符序列。



像这样:

 用于encoding1,encoding2在itertools.permutation(encodinglist(),2):
try:
unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
except:
pass


解决方案

不幸的是 encodings.aliases.aliases.keys()不是一个合适的答案。 / strong>



别名(可以/应该期望)包含多个不同的键映射到相同的值例如 1252 windows_1252 都映射到 cp1252 。如果不使用 aliases.keys()使用 set(aliases.values()),可以节省时间。 / p>

但有一个很大的问题:别名不包含没有别名的编解码器(如cp856,cp874 ,cp875,cp737和koi8_u)。

 >> from encodings.aliases import aliases 
>>>> def find(q):
... return [(k,v)for k,v in aliases.items()if q in k or q in v]
...
>>>> find('1252')#multiple aliases
[('1252','cp1252'),('windows_1252','cp1252')]
>> find('856')#no codepage 856 in aliases
[]
>>>> find('koi8')#no koi8_u in aliases
[('cskoi8r','koi8_r')]
>>> 'x'.decode('cp856')#但是cp856是一个有效的编解码器
u'x'
>>> 'x'.decode('koi8_u')#but koi8_u is a valid codec
u'x'
>>

还值得注意的是,尽管您获得了完整的编解码器列表,忽略不是关于编码/解码字符集的编解码器,而是做一些其他的转换例如 zlib quopri base64



这给我们带来了为什么你想尝试将字节编码为许多不同的编码的问题。如果我们知道,我们可能能够引导你在正确的方向。



对于一个开始,这是不明确的。一个将字节编码为unicode,一个将unicode编码为字节。你想做什么?



你真正想达到的是:你试图确定使用哪个编解码器解码一些传入的字节,并计划尝试与所有可能的编解码器? [注意:latin1会解码任何东西]你试图确定一些unicode文本的语言,试图编码它所有可能的编解码器? [note:utf8 will encode anything]。


I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?

The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.

i.e. something like this:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
  try:
    unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
  except:
    pass

解决方案

Unfortunately encodings.aliases.aliases.keys() is NOT an appropriate answer.

aliases(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252 and windows_1252 are both mapped to cp1252. You could save time if instead of aliases.keys() you use set(aliases.values()).

BUT THERE'S A WORSE PROBLEM: aliases doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).

>>> from encodings.aliases import aliases
>>> def find(q):
...     return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>

It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib, quopri, and base64.

Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.

For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?

What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].

这篇关于获取Python可以编码的所有编码的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆