如何在Python中获得可靠的unicode字符数？ [英] How to get a reliable unicode character count in Python?

查看：132 发布时间：2018/5/3 18:35:21 python google-app-engine unicode utf-16 utf-32

本文介绍了如何在Python中获得可靠的unicode字符数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Google App Engine使用Python 2.5.2，显然在启用UCS4的情况下。但GAE数据存储在内部使用UTF-8。因此，如果您将u'\\\�\\\�'（长度为2）存储到数据存储中，那么当您检索它时，会得到'\U0001d10c'（长度为1）。我试图在存储它之前和之后给出相同结果的方式来计算字符串中的Unicode字符数。因此，我在计算字符串长度并将其放入数据存储区之前，会尽快对字符串进行规范化处理（从u'\\\�\\\�'到'\U0001d10c'）。我知道我可以将它编码为UTF-8，然后再次解码，但是有没有更直接/有效的方法？ 解决方案

我知道我可以将它编码为UTF-8，然后再解码

是的，那就是当你有UCS-4字符串中的UTF-16代用品输入时，通常的习惯用法可以解决这个问题。但是，正如Mechanical snail所说，这个输入是格式错误的，你应该修正它产生的任何内容。

有没有更直接/更有效率的方法？

好的...您可以 正则表达式，如：

  re.sub（
 u'（[\\\�-\\\�]）（ （（ord（m.group（1）） -  0xD800 <= 10）+ ord（m.group（2）） -  0xDC00 + 0x10000），
s 
）

当然不是更简单...我我也怀疑它是否真的更有效率！

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?

解决方案

I know I can just encode it to UTF-8 and then decode again

Yes, that's the usual idiom to fix up the problem when you have "UTF-16 surrogates in UCS-4 string" input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

is there a more straightforward/efficient way?

Well... you could do it manually with a regex, like:

re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)

Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!

这篇关于如何在Python中获得可靠的unicode字符数？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Python中获得可靠的unicode字符数？ [英] How to get a reliable unicode character count in Python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Python中获得可靠的unicode字符数？ [英] How to get a reliable unicode character count in Python?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭