u'前缀和python中的unicode()有什么区别? [英] What is the difference between u' ' prefix and unicode() in python?
问题描述
u''
前缀和unicode()
有什么区别?
# -*- coding: utf-8 -*-
print u'上午' # this works
print unicode('上午', errors='ignore') # this works but print out nothing
print unicode('上午') # error
对于第三个print
,错误显示:UnicodeDecodeError:'ascii'编解码器无法解码位置0的字节0xe4
如果我有一个包含非字母字符的文本文件,例如上午",如何阅读并正确打印出来?
-
u'..'
是字符串文字,并且根据源编码声明对字符进行解码. -
unicode()
是一个将另一个类型转换为unicode
对象的函数,您已为其指定了 byte字符串文字.它将根据默认的ASCII编解码器解码字节字符串.
因此,您使用了不同类型的文字符号创建了字节字符串对象,然后尝试将其转换为unicode()
对象,该对象失败了,因为str
-> unicode
转换的默认编解码器为ASCII./p>
两者是完全不同的野兽.如果要使用后者,则需要为其提供明确的编解码器:
print unicode('上午', 'utf8')
两者的关联方式与使用0xFF
和int('0xFF', 0)
的关联方式相同;前者使用十六进制表示法定义值255的整数,后者使用int()
函数从字符串中提取整数.
另一种方法是使用 str.decode()
方法:
print '上午'.decode('utf8')
除非您知道自己在做什么,否则不要尝试使用错误处理程序(例如ignore'
或'replace'
).例如,'ignore'
可以通过选择错误的编解码器来掩盖潜在的问题.
您可能想阅读Python和Unicode:
-
实用的Unicode ,作者Ned Batchelder
-
每个软件开发人员绝对肯定要完全了解Unicode和字符集(没有任何借口) !),乔尔·斯波斯基(Joel Spolsky)
What is the difference between u''
prefix and unicode()
?
# -*- coding: utf-8 -*-
print u'上午' # this works
print unicode('上午', errors='ignore') # this works but print out nothing
print unicode('上午') # error
For the third print
, the error shows: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0
If I have a text file containing non-ascii characters, such as "上午", how to read it and print it out correctly?
u'..'
is a string literal, and decodes the characters according to the source encoding declaration.unicode()
is a function that converts another type to aunicode
object, you've given it a byte string literal. It'll decode a byte string according to the default ASCII codec.
So you created a byte string object using a different type of literal notation, then tried to convert it to a unicode()
object, which fails because the default codec for str
-> unicode
conversions is ASCII.
The two are quite different beasts. If you want to use the latter, you need to give it an explicit codec:
print unicode('上午', 'utf8')
The two are related in the same way that using 0xFF
and int('0xFF', 0)
are related; the former defines an integer of value 255 using hex notation, the latter uses the int()
function to extract an integer from a string.
An alternative method would be to use the str.decode()
method:
print '上午'.decode('utf8')
Don't be tempted to use an error handler (such as ignore'
or 'replace'
) unless you know what you are doing. 'ignore'
especially can mask underlying issues with having picked the wrong codec, for example.
You may want to read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
这篇关于u'前缀和python中的unicode()有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!