Python Unicode编码 [英] Python Unicode Encoding

查看:262
本文介绍了Python Unicode编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用argparse读取我的python代码的参数.这些输入之一是文件[title]的标题,该文件可以包含Unicode字符.我一直使用22少女時代22作为测试字符串.

I am using argparse to read in arguments for my python code. One of those inputs is a title of a file [title] which can contain Unicode characters. I have been using 22少女時代22 as a test string.

我需要将输入title的值写入文件,但是当我尝试将字符串转换为UTF-8时,它总是抛出错误:

I need to write the value of the input title to a file, but when I try to convert the string to UTF-8 it always throws an error:

UnicodeDecodeError:"ascii"编解码器无法解码位置2的字节0x8f:顺序 不在范围内(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128)

我一直在四处看看,我需要我的字符串以u"foo"的形式在其上调用.encode().

I have been looking around and see I need my string to be in the form u"foo" to call .encode() on it.

当我在argparse的输入上运行type()时,会看到:

When I run type() on my input from argparse I see:

<type 'str'>

我希望得到以下答复:

<type 'unicode'>

如何以正确的形式获取它?

How can I get it in the right form?

想法:

修改argparse以使用str,但将其存储为Unicode字符串u"foo":

Modify argparse to take in a str but store it as a unicode string u"foo":

parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')

这种方法根本行不通.有想法吗?

This approach is not working at all. Thoughts?

一些示例代码,其中title22少女時代22:

Some sample code where title is 22少女時代22:

inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title

推荐答案

您的输入数据似乎在 SJIS编码(日语的传统编码),它在字节串的位置2处生成字节0x8f:

It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:

>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'

(在Python 3提示符下)

(At Python 3 prompt)

现在,我猜是将字符串转换为UTF-8",您使用的是类似

Now, I'm guessing that to "convert the string to UTF-8", you used something like

title.encode('utf8')

问题在于,title实际上是一个包含SJIS编码字符串的字节字符串.由于Python 2的设计缺陷,可以直接encode d字节串,并且假定该字节串是ASCII编码的.所以您在概念上等同于

The problem is that title is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encoded, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to

title.decode('ascii').encode('utf8')

当然,decode调用也会失败.

在编码为UTF-8之前,您应该改为从SJIS显式解码为Unicode字符串:

You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:

title.decode('sjis').encode('utf8')


正如Mark Tolonen所指出的,您可能正在控制台中键入字符,并且您的控制台编码是非Unicode编码.


As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.

因此,您的sys.stdin.encoding cp932 ,这是Microsoft的变体SJIS.为此,请使用

So it turns out your sys.stdin.encoding is cp932, which is Microsoft's variant of SJIS. For this, use

title.decode('cp932').encode('utf8')

您确实应该将控制台编码设置为标准UTF-8,但是我不确定在Windows上是否可行.如果这样做,您可以跳过解码/编码步骤,而只需将输入的字节串写入文件中即可.

You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.

这篇关于Python Unicode编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆