如何检查Python unicode字符串,以*实际上*是正确的Unicode? [英] How can I check a Python unicode string to see that it *actually* is proper Unicode?
问题描述
所以我有此页面:
http://hub.iis.sinica.edu.tw/cytoHubba/
显然,由于正确解码,各种情况都一团糟但是当我尝试将其保存在postgres中时,我得到:
Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:
DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf
此后数据库会崩溃,并且拒绝执行任何操作而不会回滚,这将很难发出(很长的故事)。我有办法检查它是否会在到达数据库之前发生吗? source.encode( utf-8)正常运行,所以我不确定发生了什么事情……
The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...
推荐答案
在python 2.x中有一个 bug ,它仅是固定的python3.x。实际上,此错误甚至出现在OS X的iconv中(但不是glibc)。
There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).
这是发生了什么:
Python 2.x不能将UTF8代理对[1]识别为无效字符(这就是您的字符序列)
Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)
此应该就是所有需要的东西:
This should be all that's needed:
foo.decode('utf8').encode('utf8')
但是由于该错误,他们无法修复,因此无法捕获代理对。
But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.
在python 2.x然后在3.x中尝试:
Try this in python 2.x and then in 3.x:
b'\xed\xbd\xbf'.decode('utf8')
它将抛出错误(正确)在后者中。他们也没有在2.x分支中对其进行修复。有关更多信息,请参见[2]和[3]
It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info
[1] http://tools.ietf.org/html/rfc3629#section-4
[2] http://bugs.python.org/issue9133
[3] http://bugs.python.org/issue8271#msg102209
这篇关于如何检查Python unicode字符串,以*实际上*是正确的Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!