如何检查Python unicode字符串，以实际上是正确的Unicode？ [英] How can I check a Python unicode string to see that it actually* is proper Unicode?*

查看：136 发布时间：2020/5/29 22:10:45 python postgresql unicode

本文介绍了如何检查Python unicode字符串，以*实际上*是正确的Unicode？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我有此页面：

http://hub.iis.sinica.edu.tw/cytoHubba/

显然，由于正确解码，各种情况都一团糟但是当我尝试将其保存在postgres中时，我得到：

Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf

此后数据库会崩溃，并且拒绝执行任何操作而不会回滚，这将很难发出（很长的故事）。我有办法检查它是否会在到达数据库之前发生吗？ source.encode（ utf-8）正常运行，所以我不确定发生了什么事情……

The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...

推荐答案

在python 2.x中有一个 bug ，它仅是固定的python3.x。实际上，此错误甚至出现在OS X的iconv中（但不是glibc）。

There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

这是发生了什么：

Python 2.x不能将UTF8代理对[1]识别为无效字符（这就是您的字符序列）

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

此应该就是所有需要的东西：

This should be all that's needed:

foo.decode('utf8').encode('utf8')

但是由于该错误，他们无法修复，因此无法捕获代理对。

But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

在python 2.x然后在3.x中尝试：

Try this in python 2.x and then in 3.x:

b'\xed\xbd\xbf'.decode('utf8')

它将抛出错误（正确）在后者中。他们也没有在2.x分支中对其进行修复。有关更多信息，请参见[2]和[3]

It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

这篇关于如何检查Python unicode字符串，以*实际上*是正确的Unicode？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何检查Python unicode字符串，以实际上是正确的Unicode？ [英] How can I check a Python unicode string to see that it actually* is proper Unicode?*

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何检查Python unicode字符串，以*实际上*是正确的Unicode？ [英] How can I check a Python unicode string to see that it *actually* is proper Unicode?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何检查Python unicode字符串，以实际上是正确的Unicode？ [英] How can I check a Python unicode string to see that it actually* is proper Unicode?*

登录关闭