如何在Python中获得可靠的unicode字符数? [英] How to get a reliable unicode character count in Python?
问题描述
我知道我可以将它编码为UTF-8,然后再解码
是的,那就是当你有UCS-4字符串中的UTF-16代用品输入时,通常的习惯用法可以解决这个问题。但是,正如Mechanical snail所说,这个输入是格式错误的,你应该修正它产生的任何内容。
有没有更直接/更有效率的方法?
好的...您可以 正则表达式,如:
re.sub(
u'([\\\�-\\\�])( ((ord(m.group(1)) - 0xD800 <= 10)+ ord(m.group(2)) - 0xDC00 + 0x10000),
s
)
当然不是更简单...我我也怀疑它是否真的更有效率!
Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?
I know I can just encode it to UTF-8 and then decode again
Yes, that's the usual idiom to fix up the problem when you have "UTF-16 surrogates in UCS-4 string" input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.
is there a more straightforward/efficient way?
Well... you could do it manually with a regex, like:
re.sub(
u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
s
)
Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!
这篇关于如何在Python中获得可靠的unicode字符数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!