我如何以编程方式找到Python知道的编解码器列表? [英] How can I programmatically find the list of codecs known to Python?
问题描述
我知道我可以做以下事情:
>> import encodings,pprint
>>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
'base64_codec',
'big5',
'big5hkscs'
'bz2_codec',
'cp037',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp424',
'cp437',
'cp500',
'cp775',
'cp850',
'cp852 ',
'cp855',
'cp857',
'cp860',
'cp861',
'cp863'
'cp864',
'cp865',
'cp866',
'cp869',
'cp932',
'cp949',
'cp950',
'euc_jis_2004',
'euc_jisx0213',
'euc_jp',
'euc_kr',
'gb18030',
'gb2312',
'gbk',
'hex_codec',
'hp_roman8',
'hz',
'iso2022_jp',
'iso2022_jp_1 ',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'iso8859_10'
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'johab',
'koi8_r',
'latin_1',
'mac_cyrillic',
'mac_greek',
'mac_iceland ',
'mac_latin2',
'mac_roman',
'mac_turkish',
'mbcs',
'ptcp154',
'quopri_codec'
'rot_13',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'tactis',
'tis_620',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_7',
'utf_8',
'uu_codec',
'zlib_codec']
我也知道这不是一个完整的列表,因为它包括只有编码存在一个别名(例如cp737缺失),至少有一些伪编码丢失(例如string_escape)。
由于问题的标题说:如何以编程方式获取Python的所有编解码器/编码的列表?
如果不是以编程方式:是否有完整的在线列表?
我不认为完整的列表存储在python标准库中的任何地方。相反,编码是通过调用 encoding.search_function(encoding)
按需加载的。如果你研究的代码,看起来像编码
字符串首先规范化,然后编码
包搜索子模块 pkgutil
编码
的所有子模块,然后将它们添加到 encoding.aliases.aliases
中列出的子模块。 p> 不幸的是, encoding.aliases.aliases
包含一个编码, tactis
这不是由上面生成的,所以我试图通过联合两个集合生成完整的列表。
import编码
import os
import pkgutil
modnames = set([impname的modname,modname,ispkg in pkgutil.walk_packages(
path = [os.path.dirname (encodings .__ file__)],prefix ='')])
aliases = set(encodings.aliases.aliases.values())
print(modnames-aliases)
#set(['charmap','unicode_escape','cp1006','unicode_internal','punycode','string_escape','aliases','palmos','mac_centeuro','mac_farsi','mac_romanian','cp856 ','raw_unicode_escape','mac_croatian','utf_8_sig','mac_arabic','undefined','cp737','idna','koi8_u','cp875','cp874','iso8859_1'])
print(aliases-modnames)
#set(['tactis'])
codec_names = modnames.union(aliases)
print $ b#set(['bz2_codec','cp1140','euc_jp','cp932','punycode','euc_jisx0213','aliases','hex_codec','cp500','uu_codec','big5hkscs' 'mac_romanian','mbcs','euc_jis_2004','iso2022_jp_3','iso2022_jp_2','iso2022_jp_1','gbk','iso2022_jp_2004','unicode_internal','utf_16_be','quopri_codec','cp424','iso2022_jp ','mac_iceland','raw_unicode_escape','hp_roman8','iso2022_kr','cp875','iso8859_6','cp1254','utf_32_be','gb2312','cp850','shift_jis','cp852' 'cp855','iso8859_3','cp857','cp856','cp775','unicode_escape','cp1026','mac_latin2','utf_32','mac_cyrillic','base64_codec','ptcp154' 'mac_centeuro','euc_kr','hz','utf_8','utf_32_le','mac_greek','utf_7','mac_turkish','utf_8_sig','mac_arabic','tactis','cp949' 'zlib_codec','big5','iso8859_9','iso8859_8','iso8859_5','iso8859_4','iso8859_7','cp874','iso8859_1','utf_16_le','iso8859_2','charmap','gb18030 ','cp1006','shift_jis_2004','mac_roman','ascii','string_escape','iso8859_15','iso8859_14','tis_620','iso8859_16','iso8859_11','iso8859_10','iso8859_13' 'cp950','utf_16','cp869','mac_farsi','rot_13','cp860','cp861','cp862','cp863','cp864','cp865','cp866','shift_jisx0213 'cp1255','cp1253','cp1252','cp437','cp1258','cp1255','cp1255','cp1255','cp1255' 'undefined','cp737','koi8_r','cp037','koi8_u','iso2022_jp_ext','idna'])
I know that I can do the following:
>>> import encodings, pprint
>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
'base64_codec',
'big5',
'big5hkscs',
'bz2_codec',
'cp037',
'cp1026',
'cp1140',
'cp1250',
'cp1251',
'cp1252',
'cp1253',
'cp1254',
'cp1255',
'cp1256',
'cp1257',
'cp1258',
'cp424',
'cp437',
'cp500',
'cp775',
'cp850',
'cp852',
'cp855',
'cp857',
'cp860',
'cp861',
'cp862',
'cp863',
'cp864',
'cp865',
'cp866',
'cp869',
'cp932',
'cp949',
'cp950',
'euc_jis_2004',
'euc_jisx0213',
'euc_jp',
'euc_kr',
'gb18030',
'gb2312',
'gbk',
'hex_codec',
'hp_roman8',
'hz',
'iso2022_jp',
'iso2022_jp_1',
'iso2022_jp_2',
'iso2022_jp_2004',
'iso2022_jp_3',
'iso2022_jp_ext',
'iso2022_kr',
'iso8859_10',
'iso8859_11',
'iso8859_13',
'iso8859_14',
'iso8859_15',
'iso8859_16',
'iso8859_2',
'iso8859_3',
'iso8859_4',
'iso8859_5',
'iso8859_6',
'iso8859_7',
'iso8859_8',
'iso8859_9',
'johab',
'koi8_r',
'latin_1',
'mac_cyrillic',
'mac_greek',
'mac_iceland',
'mac_latin2',
'mac_roman',
'mac_turkish',
'mbcs',
'ptcp154',
'quopri_codec',
'rot_13',
'shift_jis',
'shift_jis_2004',
'shift_jisx0213',
'tactis',
'tis_620',
'utf_16',
'utf_16_be',
'utf_16_le',
'utf_32',
'utf_32_be',
'utf_32_le',
'utf_7',
'utf_8',
'uu_codec',
'zlib_codec']
I also know for sure that this is not a complete list, since it includes only encodings for which an alias exists (e.g "cp737" is missing), and at least some pseudo-encodings are missing (e.g "string_escape").
As the title of the question says: how can I programmatically get a list of all codecs/encodings known to Python?
If not programmatically: is there a complete list available online?
I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding)
. If you study the code there, it looks like encoding
string is first normalized and then the encodings
package is searched for submodules whose name matches encoding
.
The following uses pkgutil
to list all the submodules of encoding
, and then adds them to those listed in encoding.aliases.aliases
.
Unfortunately, encoding.aliases.aliases
contains one encoding, tactis
that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.
import encodings
import os
import pkgutil
modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())
print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])
print(aliases-modnames)
# set(['tactis'])
codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])
这篇关于我如何以编程方式找到Python知道的编解码器列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!