如何检查Python unicode字符串,以*实际上*是正确的Unicode? [英] How can I check a Python unicode string to see that it *actually* is proper Unicode?

查看:136
本文介绍了如何检查Python unicode字符串,以*实际上*是正确的Unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有此页面:

http://hub.iis.sinica.edu.tw/cytoHubba/

显然,由于正确解码,各种情况都一团糟但是当我尝试将其保存在postgres中时,我得到:

Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf

此后数据库会崩溃,并且拒绝执行任何操作而不会回滚,这将很难发出(很长的故事)。我有办法检查它是否会在到达数据库之前发生吗? source.encode( utf-8)正常运行,所以我不确定发生了什么事情……

The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...

推荐答案

在python 2.x中有一个 bug ,它仅是固定的python3.x。实际上,此错误甚至出现在OS X的iconv中(但不是glibc)。

There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

这是发生了什么:

Python 2.x不能将UTF8代理对[1]识别为无效字符(这就是您的字符序列)

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

应该就是所有需要的东西:

This should be all that's needed:

foo.decode('utf8').encode('utf8')

但是由于该错误,他们无法修复,因此无法捕获代理对。

But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

在python 2.x然后在3.x中尝试:

Try this in python 2.x and then in 3.x:

b'\xed\xbd\xbf'.decode('utf8')

它将抛出错误(正确)在后者中。他们也没有在2.x分支中对其进行修复。有关更多信息,请参见[2]和[3]

It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

这篇关于如何检查Python unicode字符串,以*实际上*是正确的Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆