当使用特殊字符时,Python返回错误的字符串长度 [英] Python returning the wrong length of string when using special characters
问题描述
我有一个字符串ëaúlt,我想获得基于字符位置的操作的长度等等。问题是,第一个ë被计数两次,或者我猜ë在位置0和'在位置1。
I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
在Python中有可能的方式有一个字符,如ë表示为1?
Is there any possible way in Python to have a character like ë́ be represented as 1?
我正在使用UTF-8编码的实际代码和它被输出的网页。
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
编辑:只是一些背景为什么我需要这样做。我正在做一个项目,翻译英语到塞内卡(美国本土语言的一种形式),ë显示了很多。某些字词的某些重写规则需要知道字母位置(本身和周围字母)和其他特征,例如重音和其他变音符号。
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
推荐答案
UTF-8是一个Unicode字符编码,它对特殊字符使用多个字节。如果你不想要编码字符串的长度,简单解码它,并使用 len()
在 unicode
对象(而不是 str
对象!)。
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len()
on the unicode
object (and not the str
object!).
以下是一些示例:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
当然,您也可以访问 unicode
对象中的单个字符,就像您在 str
object(它们都继承自 basestring
,因此具有相同的方法):
Of course, you can also access single characters in an unicode
object like you would do in a str
object (they are both inheriting from basestring
and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
如果开发本地化应用程序,在内部只使用 unicode
- 对象,通过解码您获得的所有输入。工作完成后,您可以将结果再次编码为UTF-8。如果你坚持这个原则,你永远不会看到你的服务器崩溃,因为任何内部 UnicodeDecodeError
你可能会得到否则;)
If you develop localized applications, it's generally a good idea to use only unicode
-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeError
s you might get otherwise ;)
PS:请注意,Python 3中的 str
和 unicode
数据类型发生了显着变化。 3只有unicode字符串和纯字节字符串不能再混合了。这应该有助于避免与unicode处理常见的陷阱...
PS: Please note, that the str
and unicode
datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
尊敬的,
Christoph
Regards, Christoph
这篇关于当使用特殊字符时,Python返回错误的字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!