Python ascii utf unicode [英] Python ascii utf unicode

查看:142
本文介绍了Python ascii utf unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我用 p = xml.parsers.expat.ParserCreate()解析这个XML时:

When I parse this XML with p = xml.parsers.expat.ParserCreate():

<name>Fortuna D&#252;sseldorf</name>

字符解析事件处理程序包括 u'\xfc'

The character parsing event handler includes u'\xfc'.

如何将 u'\xfc'变成 u'ü'

这是这篇文章的主要问题,其余的只是显示进一步的(咆哮)的想法

This is the main question in this post, the rest just shows further (ranting) thoughts about it

不是Python unicode坏了,因为 u'\xfc'收益u'ü'而没有别的?
u'\xfc'已经是一个unicode字符串,所以再转换为unicode不起作用!
将其转换为ASCII也不起作用。

Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else? u'\xfc' is already a unicode string, so converting it to unicode again doesn't work! Converting it to ASCII as well doesn't work.

我发现唯一的工作是:(这不可能,对吗?)

The only thing that I found works is: (This cannot be intended, right?)

exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')

用utf-8替换8859失败!这是什么意思?

Replacing 8859 with utf-8 fails! What is the point of that?

Python unicode HOWTO还有什么意义? - 它只是给出了一个失败的例子,而不是显示如何进行转换(特别是数以百计的ppl在这里提出类似的问题)实际上用于现实世界。

Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.

Unicode不是魔术 - 为什么这么多ppl这里有问题?

Unicode is no magic - why do so many ppl here have issues?

unicode转换的根本问题是很简单:

The underlying problem of unicode conversion is dirt simple:

一个双向查询表'\xFC'< - >u'ü'

One bidirectional lookup table '\xFC' <-> u'ü'

unicode( 'Fortuna D\xfcsseldorf' ) 

为什么Python的创建者认为最好显示一个错误而不是简单地产生: u'FortunaDüsseldorf'

What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?

他们为什么使它不可逆? :

Also why did they made it not reversible?:

 >>> u'Fortuna Düsseldorf'.encode('utf-8')
 'Fortuna D\xc3\xbcsseldorf'
 >>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
 u'Fortuna D\xfcsseldorf'    


推荐答案

已经有值。 Python简单地尝试通过给你一个ASCII的表示来简化调试。在翻译器中回显值可以让您调用 repr( ) 的结果。

You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.

换句话说,你会混淆价值的表示具有价值本身。该表示旨在安全复制和粘贴,而不用担心其他系统如何处理非ASCII码点。因此,使用了Python 字符串文字语法,使用任何非可替换和非ASCII字符替换为 \xhh \uhhhh 转义序列。将这些字符串粘贴回Python字符串或交互式Python会话将重现完全相同的值。

In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.

如此ü已被 \xfc 替换,因为这是 00FCrel =nofollow> U + 00FC拉丁小姐用DIAERESIS 代码点。

As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.

如果您的终端配置正确,您只需使用 print ,Python会将Unicode值编码为终端编解码器,导致终端显示给您非ASCII字形:

If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:

>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf

如果您的终端配置为UTF-8您还可以在显式编码后直接将UTF-8字节写入终端:

If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:

>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf

是为你升级到Python 3;那么 repr()只编码没有可打印字形(控制代码,保留代码点,代理等)的代码点。新的 ascii()函数给你的Python 2 repr()行为仍然。

The alternative is for you upgrade to Python 3; there repr() only encodes codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc). The new ascii() function gives you the Python 2 repr() behaviour still.

这篇关于Python ascii utf unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆