在Python中将UTF-8转换为字符串文字 [英] Convert UTF-8 to string literals in Python

查看:1144
本文介绍了在Python中将UTF-8转换为字符串文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个UTF-8格式的字符串,但不确定如何将其转换为相应的字符文字.例如,我有字符串:

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

我的字符串是:'Entre\xc3\xa9'

示例一:

此代码:

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

返回结果:u'Entre\xe9'

如果我再继续打印此内容:

If I then continue by printing this:

print u'Entre\xe9'

我得到结果:Entreé

这很棒,很接近我的需求.问题是,我无法将'Entre \ xc3 \ xa9'设置为变量,并且无法通过步骤传递它,因为现在这已经中断了.有任何技巧可以使它正常工作吗?

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

示例:

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

我希望"c"的结果为:

I would like result of "c" to be:

Entreé

推荐答案

u''语法仅适用于字符串文字,例如在源代码中定义值.使用语法可以创建unicode对象,但这不是创建此类对象的唯一方法.

The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.

不能通过在字节串前面添加u来从字节串中获取unicode值.但是,如果您使用正确的编码调用了str.decode(),则会得到一个unicode值.反之亦然,您可以使用unicode.encode() 编码 unicode对象到字节字符串.

You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().

请注意,在显示unicode对象时,Python再次使用Unicode字符串文字语法(因此是u'...')来表示 ,以简化调试.您可以将表示形式重新粘贴到Python解释器中,并获得具有相同值的对象.

Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

您的a值是使用字节字符串文字定义的,因此您只需要解码:

Your a value is defined using a byte string literal, so you only need to decode:

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

您的第一个示例创建了 Mojibake ,这是一个Unicode字符串,其中包含实际表示的Latin-1代码点UTF-8字节.这就是为什么您必须先对Latin-1进行编码(以撤消Mojibake),然后再从UTF-8进行解码的原因.

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

您可能想在 Unicode HOWTO 中阅读Python和Unicode. .其他有趣的文章是:

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

实用的Unicode ,作者Ned Batchelder

Pragmatic Unicode by Ned Batchelder

这篇关于在Python中将UTF-8转换为字符串文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆