在 Windows 上的 Python 2.x 中从命令行参数读取 Unicode 字符 [英] Read Unicode characters from command-line arguments in Python 2.x on Windows
问题描述
我希望我的 Python 脚本能够在 Windows 中读取 Unicode 命令行参数.但看起来 sys.argv 是以某种本地编码而不是 Unicode 编码的字符串.如何以完整的 Unicode 读取命令行?
I want my Python script to be able to read Unicode command line arguments in Windows. But it appears that sys.argv is a string encoded in some local encoding, rather than Unicode. How can I read the command line in full Unicode?
示例代码:argv.py
import sys
first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)
在我为日语代码页设置的 PC 上,我得到:
On my PC set up for Japanese code page, I get:
C: emp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>
我相信这是 Shift-JIS 编码的,它对那个文件名有效".但是对于包含不在 Shift-JIS 字符集中的字符的文件名,它会中断——最终的open"调用失败:
That's Shift-JIS encoded I believe, and it "works" for that filename. But it breaks for filenames with characters that aren't in the Shift-JIS character set—the final "open" call fails:
C: emp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
File "C: empargv.py", line 7,
in <module>
print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'
注意——我说的是 Python 2.x,而不是 Python 3.0.我发现 Python 3.0 将 sys.argv
作为正确的 Unicode.但是现在过渡到 Python 3.0 还为时过早(由于缺乏 3rd 方库支持).
Note—I'm talking about Python 2.x, not Python 3.0. I've found that Python 3.0 gives sys.argv
as proper Unicode. But it's a bit early yet to transition to Python 3.0 (due to lack of 3rd party library support).
更新:
一些答案说我应该根据 sys.argv
编码的任何内容进行解码.问题在于它不是完整的 Unicode,因此某些字符无法表示.
A few answers have said I should decode according to whatever the sys.argv
is encoded in. The problem with that is that it's not full Unicode, so some characters are not representable.
这是让我感到悲伤的用例:我启用了将文件拖放到 Windows 资源管理器中的 .py 文件中.我有包含各种字符的文件名,包括一些不在系统默认代码页中的文件名.当字符在当前代码页编码中无法表示时,我的 Python 脚本在所有情况下都无法通过 sys.argv 获得正确的 Unicode 文件名.
Here's the use case that gives me grief: I have enabled drag-and-drop of files onto .py files in Windows Explorer. I have file names with all sorts of characters, including some not in the system default code page. My Python script doesn't get the right Unicode filenames passed to it via sys.argv in all cases, when the characters aren't representable in the current code page encoding.
当然有一些 Windows API 可以读取带有完整 Unicode 的命令行(Python 3.0 可以做到).我假设 Python 2.x 解释器没有使用它.
There is certainly some Windows API to read the command line with full Unicode (and Python 3.0 does it). I assume the Python 2.x interpreter is not using it.
推荐答案
这是我正在寻找的解决方案,调用 Windows GetCommandLineArgvW
函数:
Windows下获取带有Unicode字符的sys.argv(来自ActiveState)
Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW
function:
Get sys.argv with Unicode characters under Windows (from ActiveState)
但我做了一些更改,以简化其使用并更好地处理某些用途.这是我使用的:
But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:
win32_unicode_argv.py
"""
win32_unicode_argv.py
Importing this will replace sys.argv with a full Unicode form.
Windows only.
From this site, with adaptations:
http://code.activestate.com/recipes/572200/
Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""
import sys
def win32_unicode_argv():
"""Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
strings.
Versions 2.x of Python don't support Unicode in sys.argv on
Windows, with the underlying Windows API instead replacing multi-byte
characters with '?'.
"""
from ctypes import POINTER, byref, cdll, c_int, windll
from ctypes.wintypes import LPCWSTR, LPWSTR
GetCommandLineW = cdll.kernel32.GetCommandLineW
GetCommandLineW.argtypes = []
GetCommandLineW.restype = LPCWSTR
CommandLineToArgvW = windll.shell32.CommandLineToArgvW
CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
CommandLineToArgvW.restype = POINTER(LPWSTR)
cmd = GetCommandLineW()
argc = c_int(0)
argv = CommandLineToArgvW(cmd, byref(argc))
if argc.value > 0:
# Remove Python executable and commands if present
start = argc.value - len(sys.argv)
return [argv[i] for i in
xrange(start, argc.value)]
sys.argv = win32_unicode_argv()
现在,我使用它的方式很简单:
Now, the way I use it is simply to do:
import sys
import win32_unicode_argv
从那时起,sys.argv
是一个 Unicode 字符串列表.Python optparse
模块似乎很乐意解析它,这很棒.
and from then on, sys.argv
is a list of Unicode strings. The Python optparse
module seems happy to parse it, which is great.
这篇关于在 Windows 上的 Python 2.x 中从命令行参数读取 Unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!