在Python中将UTF-8转换为字符串文字 [英] Convert UTF-8 to string literals in Python
问题描述
我有一个UTF-8格式的字符串,但不确定如何将其转换为相应的字符文字.例如,我有字符串:
I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
我的字符串是:'Entre\xc3\xa9'
示例一:
此代码:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
返回结果:u'Entre\xe9'
如果我再继续打印此内容:
If I then continue by printing this:
print u'Entre\xe9'
我得到结果:Entreé
这很棒,很接近我的需求.问题是,我无法将'Entre \ xc3 \ xa9'设置为变量,并且无法通过步骤传递它,因为现在这已经中断了.有任何技巧可以使它正常工作吗?
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
示例:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
我希望"c"的结果为:
I would like result of "c" to be:
Entreé
推荐答案
u''
语法仅适用于字符串文字,例如在源代码中定义值.使用语法可以创建unicode
对象,但这不是创建此类对象的唯一方法.
The u''
syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode
object being created, but that's not the only way to create such an object.
不能通过在字节串前面添加u
来从字节串中获取unicode
值.但是,如果您使用正确的编码调用了str.decode()
,则会得到一个unicode
值.反之亦然,您可以使用unicode.encode()
编码 unicode
对象到字节字符串.
You cannot make a unicode
value from a byte string by adding u
in front of it. But if you called str.decode()
with the right encoding, you get a unicode
value. Vice-versa, you can encode unicode
objects to byte strings with unicode.encode()
.
请注意,在显示unicode
对象时,Python再次使用Unicode字符串文字语法(因此是u'...'
)来表示 ,以简化调试.您可以将表示形式重新粘贴到Python解释器中,并获得具有相同值的对象.
Note that when displaying a unicode
object, Python represents it by using the Unicode string literal syntax again (so u'...'
), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
您的a
值是使用字节字符串文字定义的,因此您只需要解码:
Your a
value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
您的第一个示例创建了 Mojibake ,这是一个Unicode字符串,其中包含实际表示的Latin-1代码点UTF-8字节.这就是为什么您必须先对Latin-1进行编码(以撤消Mojibake),然后再从UTF-8进行解码的原因.
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
您可能想在 Unicode HOWTO 中阅读Python和Unicode. .其他有趣的文章是:
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
-
每个软件开发人员绝对肯定要完全了解Unicode和字符集(没有任何借口) !),乔尔·斯波斯基(Joel Spolsky)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
实用的Unicode ,作者Ned Batchelder
Pragmatic Unicode by Ned Batchelder
这篇关于在Python中将UTF-8转换为字符串文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!