通过将4字节unicode插入mysql引发警告 [英] Warning raised by inserting 4-byte unicode to mysql

查看：116 发布时间：2020/5/14 20:34:56 python mysql regex astral-plane

本文介绍了通过将4字节unicode插入mysql引发警告的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

查看以下内容:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

字符串'\xF0\x9F\x91\x8A，实际上是一个4字节的unicode:u'\U0001f62a'. mysql的字符集是utf-8，但是插入4字节的unicode会截断插入的字符串. 我用谷歌搜索了这样的问题，发现5.5.3下的mysql不支持4字节的unicode，不幸的是我的是5.5.224. 我不想升级mysql服务器，所以我只想过滤python中的4字节unicode，我尝试使用正则表达式但失败了. 那么，有什么帮助吗?

The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed. So, any help?

推荐答案

如果MySQL无法处理4字节或更多字节的UTF-8代码，则必须过滤掉代码点\U00010000上的所有unicode字符； UTF-8将低于该阈值的代码点编码为3个字节或更少.

If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

您可以为此使用正则表达式:

You could use a regular expression for that:

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

或者，您可以将 .translate()函数与仅包含None值的映射表:

Alternatively, you could use the .translate() function with a mapping table that only contains None values:

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

但是，创建转换表将占用大量内存，并且需要花费一些时间来生成；正则表达式方法更有效，这可能不值得您花大力气.

However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

所有这些都假定您使用的是UCS-4编译的python.如果您的python是使用UCS-2支持编译的，则只能在正则表达式中使用不超过'\U0000ffff'的代码点，而且您永远不会遇到这个问题.

This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

我注意到从MySQL 5.5.3开始，新添加的

I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

这篇关于通过将4字节unicode插入mysql引发警告的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过将4字节unicode插入mysql引发警告 [英] Warning raised by inserting 4-byte unicode to mysql

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

通过将4字节unicode插入mysql引发警告 [英] Warning raised by inserting 4-byte unicode to mysql

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭