Python Unicode编码 [英] Python Unicode Encoding
问题描述
我正在使用argparse
读取我的python代码的参数.这些输入之一是文件[title
]的标题,该文件可以包含Unicode字符.我一直使用22少女時代22
作为测试字符串.
I am using argparse
to read in arguments for my python code. One of those inputs is a title of a file [title
] which can contain Unicode characters. I have been using 22少女時代22
as a test string.
我需要将输入title
的值写入文件,但是当我尝试将字符串转换为UTF-8
时,它总是抛出错误:
I need to write the value of the input title
to a file, but when I try to convert the string to UTF-8
it always throws an error:
UnicodeDecodeError:"ascii"编解码器无法解码位置2的字节0x8f:顺序 不在范围内(128)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128)
我一直在四处看看,我需要我的字符串以u"foo"
的形式在其上调用.encode()
.
I have been looking around and see I need my string to be in the form u"foo"
to call .encode()
on it.
当我在argparse
的输入上运行type()
时,会看到:
When I run type()
on my input from argparse
I see:
<type 'str'>
我希望得到以下答复:
<type 'unicode'>
如何以正确的形式获取它?
How can I get it in the right form?
想法:
修改argparse
以使用str
,但将其存储为Unicode字符串u"foo"
:
Modify argparse
to take in a str
but store it as a unicode string u"foo"
:
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')
这种方法根本行不通.有想法吗?
This approach is not working at all. Thoughts?
一些示例代码,其中title
是22少女時代22
:
Some sample code where title
is 22少女時代22
:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
推荐答案
您的输入数据似乎在 SJIS编码(日语的传统编码),它在字节串的位置2处生成字节0x8f:
It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:
>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'
(在Python 3提示符下)
(At Python 3 prompt)
现在,我猜是将字符串转换为UTF-8",您使用的是类似
Now, I'm guessing that to "convert the string to UTF-8", you used something like
title.encode('utf8')
问题在于,title
实际上是一个包含SJIS编码字符串的字节字符串.由于Python 2的设计缺陷,可以直接encode
d字节串,并且假定该字节串是ASCII编码的.所以您在概念上等同于
The problem is that title
is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encode
d, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to
title.decode('ascii').encode('utf8')
当然,decode
调用也会失败.
在编码为UTF-8之前,您应该改为从SJIS显式解码为Unicode字符串:
You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:
title.decode('sjis').encode('utf8')
正如Mark Tolonen所指出的,您可能正在控制台中键入字符,并且您的控制台编码是非Unicode编码.
As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.
因此,您的sys.stdin.encoding
是 cp932
,这是Microsoft的变体SJIS.为此,请使用
So it turns out your sys.stdin.encoding
is cp932
, which is Microsoft's variant of SJIS. For this, use
title.decode('cp932').encode('utf8')
您确实应该将控制台编码设置为标准UTF-8,但是我不确定在Windows上是否可行.如果这样做,您可以跳过解码/编码步骤,而只需将输入的字节串写入文件中即可.
You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.
这篇关于Python Unicode编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!