将命令行输入解码为Unicode Python 2.7脚本的最佳方法 [英] Best way to decode command line inputs to Unicode Python 2.7 scripts

查看:177
本文介绍了将命令行输入解码为Unicode Python 2.7脚本的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的所有脚本始终都使用Unicode文字,

  from __future__ import unicode_literals 

但这会在使用字节串调用函数的可能性时产生一个问题,我想知道最好的处理方法是会产生明显的有用错误。





我调用脚本:

  wim-macbook:tmp wim $ ./spam.pyÇÇÇ
sys.stdin.encoding是ISO8859-1
sys.getfilesystemencoding()是utf-8
编码'\xc7'
命名空间(first ='\xc7',second ='failed!',third = u'\xc7')

如您所见,命令行参数是使用latin-1编码的,因此第二个命令行参数(使用 sys.getfilesystemencoding )无法解码。第三个命令行参数(使用 sys.stdin.encoding )可以正确解码。


All my scripts use Unicode literals throughout, with

from __future__ import unicode_literals

but this creates a problem when there is the potential for functions being called with bytestrings, and I'm wondering what the best approach is for handling this and producing clear helpful errors.

I gather that one common approach, which I've adopted, is to simply make this clear when it occurs, with something like

def my_func(somearg):
    """The 'somearg' argument must be Unicode."""
    if not isinstance(arg, unicode):
        raise TypeError("Parameter 'somearg' should be a Unicode")
    # ...

for all arguments that need to be Unicode (and might be bytestrings). However even if I do this, I encounter problems with my argparse command line script if supplied parameters correspond to such arguments, and I wonder what the best approach here is. It seems that I can simply check the encoding of such arguments, and decode them using that encoding, with, for example

if __name__ == '__main__':
    parser = argparse.ArgumentParser(...)
    parser.add_argument('somearg', ...)
    # ...

    args = parser.parse_args()
    some_arg = args.somearg
    if not isinstance(config_arg, unicode):
        some_arg = some_arg.decode(sys.getfilesystemencoding())

    #...
    my_func(some_arg, ...)

Is this combination of approaches a common design pattern for Unicode modules that may receive bytestring inputs? Specifically,

  • can I reliable decode command line arguments in this way, and
  • will sys.getfilesystemencoding() give me the correct encoding for command line arguments; or
  • does argparse provide some builtin facility for accomplishing this that I've missed?

解决方案

I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.

Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.

Additionally, you might consider using the type keyword argument when you add an argument:

import sys
import argparse as ap

def foo(str_, encoding=sys.stdin.encoding):
    return str_.decode(encoding)

parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()

print repr(args)

Demo:

$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)

If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.


Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:

>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'

Now I create an example script:

#!/usr/bin/python2.7
import argparse as ap
import sys

print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()

def encoded(s):
    print 'encoded', repr(s)
    return s

def decoded_filesystemencoding(s):
    try:
        s = s.decode(sys.getfilesystemencoding())
    except UnicodeDecodeError:
        s = 'failed!'
    return s

def decoded_stdinputencoding(s):
    try:
        s = s.decode(sys.stdin.encoding)
    except UnicodeDecodeError:
        s = 'failed!'
    return s

parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()

print repr(args)

Then I change my shell encoding to ISO/IEC 8859-1:

And I call the script:

wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is  ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')

As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.

这篇关于将命令行输入解码为Unicode Python 2.7脚本的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆