Python在使用特殊字符时返回错误长度的字符串 [英] Python returning the wrong length of string when using special characters

查看:22
本文介绍了Python在使用特殊字符时返回错误长度的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串 ë́aúlt,我想根据字符位置等获取操作的长度.问题是第一个 ë́ 被计算了两次,或者我猜 ë 在位置 0 并且 ´ 在位置 1.

在 Python 中是否有任何可能的方法可以将 ë́ 这样的字符表示为 1?

我将 UTF-8 编码用于输出到的实际代码和网页.

只是一些关于为什么我需要这样做的背景.我正在做一个将英语翻译成 Seneca(一种美洲原住民语言)的项目,而且 ë́ 出现了很多.某些单词的一些重写规则需要了解字母位置(本身和周围的字母)和其他特征,例如重音和其他变音符号.

解决方案

UTF-8 是一种 Unicode 编码,它使用多个字节来表示特殊字符.如果您不想要编码字符串的长度,请对其进行简单解码并在 unicode 对象上使用 len()(而不是 str> 对象!).

以下是一些示例:

<预><代码>>>># 创建一个 str 文字(使用 utf-8 编码,如果这是>>># 指定在文件的开头):>>>len('ë́aúlt')9>>># 创建一个 unicode 文字(你通常应该使用这个>>># 版本(如果您正在处理特殊字符):>>>len(u'ë́aúlt')6>>># 相同的 str 文字(以编码符号编写):>>>len('xc3xabxccx81axc3xbalt')9>>># 您可以通过decode() 将任何str 转换为unicode 对象:>>>len('xc3xabxccx81axc3xbalt'.decode('utf-8'))6

当然,您也可以像在 str 对象中那样访问 unicode 对象中的单个字符(它们都继承自 basestring,因此具有相同的方法):

<预><代码>>>>test = u'ë́aúlt'>>>打印测试[0]ë

如果您开发本地化应用程序,通常在内部仅使用 unicode 对象是一个好主意,通过解码您获得的所有输入.工作完成后,您可以再次将结果编码为UTF-8".如果你坚持这个原则,你永远不会看到你的服务器因为任何内部的 UnicodeDecodeError 而崩溃,否则你可能会得到 ;)

PS:请注意,strunicode 数据类型在 Python 3 中发生了显着变化.在 Python 3 中,只有 unicode 字符串和纯字节字符串可以'不要再混了.这应该有助于避免 unicode 处理的常见陷阱...

问候,克里斯托夫

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

Is there any possible way in Python to have a character like ë́ be represented as 1?

I'm using UTF-8 encoding for the actual code and web page it is being outputted to.

edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

解决方案

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).

Here are some examples:

>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt') 
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt') 
6
>>> # the same str literal (written in an encoded notation):
>>> len('xc3xabxccx81axc3xbalt') 
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('xc3xabxccx81axc3xbalt'.decode('utf-8')) 
6

Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):

>>> test = u'ë́aúlt'
>>> print test[0]
ë

If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)

PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...

Regards, Christoph

这篇关于Python在使用特殊字符时返回错误长度的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆