当使用特殊字符时,Python返回错误的字符串长度 [英] Python returning the wrong length of string when using special characters

查看:168
本文介绍了当使用特殊字符时,Python返回错误的字符串长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串ëaúlt,我想获得基于字符位置的操作的长度等等。问题是,第一个ë被计数两次,或者我猜ë在位置0和'在位置1。

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

在Python中有可能的方式有一个字符,如ë表示为1?

Is there any possible way in Python to have a character like ë́ be represented as 1?

我正在使用UTF-8编码的实际代码和它被输出的网页。

I'm using UTF-8 encoding for the actual code and web page it is being outputted to.

编辑:只是一些背景为什么我需要这样做。我正在做一个项目,翻译英语到塞内卡(美国本土语言的一种形式),ë显示了很多。某些字词的某些重写规则需要知道字母位置(本身和周围字母)和其他特征,例如重音和其他变音符号。

edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

推荐答案

UTF-8是一个Unicode字符编码,它对特殊字符使用多个字节。如果你不想要编码字符串的长度,简单解码它,并使用 len() unicode 对象(而不是 str 对象!)。

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).

以下是一些示例:

>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt') 
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt') 
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt') 
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8')) 
6

当然,您也可以访问 unicode 对象中的单个字符,就像您在 str object(它们都继承自 basestring ,因此具有相同的方法):

Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):

>>> test = u'ë́aúlt'
>>> print test[0]
ë

如果开发本地化应用程序,在内部只使用 unicode - 对象,通过解码您获得的所有输入。工作完成后,您可以将结果再次编码为UTF-8。如果你坚持这个原则,你永远不会看到你的服务器崩溃,因为任何内部 UnicodeDecodeError 你可能会得到否则;)

If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)

PS:请注意,Python 3中的 str unicode 数据类型发生了显着变化。 3只有unicode字符串和纯字节字符串不能再混合了。这应该有助于避免与unicode处理常见的陷阱...

PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...

尊敬的,
Christoph

Regards, Christoph

这篇关于当使用特殊字符时,Python返回错误的字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆