相同unicode的python不同长度 [英] python different length for same unicode

查看:46
本文介绍了相同unicode的python不同长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现unicode有些奇怪,据我所知,如果我用"" + "string",类型会是unicode,但为什么它们的长度不同?

print len(u''+'New York\u200b')14打印类型(u''+'纽约\u200b')<输入'unicode'>打印 len(u'New York\u200b')9打印类型(u'纽约\u200b')<输入'unicode'>

我也试图摆脱\u200b,我认为它是unicode

text = u'New York\u200b'打印 text.encode('ascii', errors='ignore')纽约text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b

也得到了不同的结果,我真的很困惑!btw,我用的是python 2.7,是时候改成3.3了吗??提前致谢!!

解决方案

>>>(u''+'New York\u200b').encode('utf-8')'纽约\\u200b'

如您所见,由于 'New York\u200b'不是 unicode 字符串,所以 \u 转义 doesn't 有任何特殊含义,按字面解释,即作为 ASCII 字符序列 \ u 2 0 0 b,因此字符串长度为14.u'' 仅将字符串转换为 unicode,但它不会 导致重新解释内容.将 u 放在文字之前会使 python 将其解释为转义符,因此是单个字符,因此字符串的长度为 9.

在你的第二个例子中:

<块引用>

text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b

这里的.encode修改字符串中的字符,它只将unicode转换为str.

如果你打印两个字符串的内容可能会更清楚

<预><代码>>>>print(u'New York\u200b') # 注意:\u200b 被解释为 unicode 字符纽约>>>打印(b'纽约\u200b'.decode('ascii'))纽约\u200b

或者,如果您希望看到实际的 unicode 表示,请尝试使用代码点 9731:

<预><代码>>>>打印(u'纽约\u2603')纽约☃>>>打印(b'纽约\u2603'.解码('ascii'))纽约\u2603

I found something really weird about unicode, in my understanding, if I u"" + "string", the type will be unicode, but why are their length different?

print len(u''+'New York\u200b')
14
print type(u''+'New York\u200b')
<type 'unicode'>
print len(u'New York\u200b')
9
print type(u'New York\u200b')
<type 'unicode'>

I also tried to get rid of \u200b, which I think it is unicode

text = u'New York\u200b'
print text.encode('ascii', errors='ignore')
New York
text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b

Also got different result, I am really confused! btw, I am using python 2.7, is it the time to change to 3.3?? Thanks in advance!!

解决方案

>>> (u''+'New York\u200b').encode('utf-8')
'New York\\u200b'

As you can see, since 'New York\u200b' is not a unicode string, the \u escape doesn't have any special meaning and it is interpreted literally, i.e. as the sequence of ASCII characters \ u 2 0 0 b, hence the string has length 14. The u'' only converts the string to unicode, but it does not cause a re-interpretation of the contents. Putting the u before the literal makes python interpret it as an escape, hence as a single character, hence the string is length 9.

In your second example:

text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b

Here the .encode does not modify the characters in the string, it only converts from unicode to str.

It's probably clearer if you print the contents of the two strings

>>> print(u'New York\u200b')  # note: \u200b interpreted as unicode character
New York
>>> print(b'New York\u200b'.decode('ascii'))
New York\u200b

Or if you prefer to see an actual unicode representation try with code point 9731:

>>> print(u'New York\u2603')
New York☃
>>> print(b'New York\u2603'.decode('ascii'))
New York\u2603

这篇关于相同unicode的python不同长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆