如何从输入中过滤表情符号字符,以便可以在MySQL< 5.5中保存? [英] How can I filter Emoji characters from my input so I can save in MySQL <5.5?

查看:150
本文介绍了如何从输入中过滤表情符号字符,以便可以在MySQL< 5.5中保存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Django应用程序,该应用程序从Twitter的API获取推文数据并将其保存在MySQL数据库中.据我所知(我仍在思考字符编码的重点),到处都在使用UTF-8,包括MySQL编码和排序规则,除非一条tweet包含 Emoji 字符,据我所知使用四字节编码.尝试保存它们会在Django中产生以下警告:

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:

/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86:警告:错误的字符串值:'\ xF0 \ x9F \ x98 \ xAD I ...'在第1行的'text'列 返回self.cursor.execute(query,args)

/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1 return self.cursor.execute(query, args)

我正在使用MySQL 5.1,因此使用除非我升级到5.5,否则utf8mb4 是不可行的,我宁愿现在还不行(同样,从我读过的内容来看,Django对此的支持还不是生产就绪的,尽管可能不再支持).准确).我还看到民间建议使用BLOB而不是TEXT在受影响的列上,我也不想这样做,因为我认为这会影响性能.

I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.

那么我的问题是,假设我不太担心100%保留推文内容,有没有办法我可以过滤掉所有Emoji字符并将其替换为非多字节字符,例如尊贵的字符WHITE MEDIUM SMALL SQUARE (U+25FD)?我认为这是在当前设置下保存数据最简单的方法,尽管如果我错过了另一个明显的解决方案,我也很乐意听到!

My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!

仅供参考,我在Ubuntu 10.04.4 LTS上使用库存Python 2.6.5. sys.maxunicode是1114111,所以它是UCS-4版本.

FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.

感谢阅读.

推荐答案

因此,这已经被回答了好几次了,我只是没有足够的Google-fu来找到现有的问题.

So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.

  • Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
  • Warning raised by inserting 4-byte unicode to mysql

由于 Martijn Pieters ,该解决方案来自于正则表达式领域,特别是此代码(基于他对上面第一个链接的回答):

Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):

import re
try:
    # UCS-4
    highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    # UCS-2
    highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)

我要替换的字符是WHITE MEDIUM SMALL SQUARE (U+25FD),仅供参考,

The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.

对于像我这样不熟悉UCS的用户,这是一个用于Unicode转换的系统,并且给定的Python版本将支持UCS-2或UCS-4变体,每种变体在字符上都有不同的上限支持.

For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.

添加此代码后,字符串似乎可以在MySQL 5.1中永久保留.

With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.

希望这对处于相同情况的其他人有帮助!

Hope this helps anyone else in the same situation!

这篇关于如何从输入中过滤表情符号字符,以便可以在MySQL&lt; 5.5中保存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆