在python中使用re删除unicode表情符号 [英] remove unicode emoji using re in python
问题描述
我试图从Unicode鸣叫文本中删除表情符号,并使用python 2.7打印结果。
I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using
myre = re.compile(u'[\u1F300-\u1F5FF\u1F600-\u1F64F\u1F680-\u1F6FF\u2600-\u26FF\u2700-\u27BF]+',re.UNICODE)
print myre.sub('', text)
但似乎几乎所有字符都是从文本中删除。我检查了其他帖子的几个答案,很遗憾,这些答案都无法在此工作。我在re.compile()中做任何事情吗?
but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?
下面是一个示例输出,其中所有字符均被删除:
here is an example output that all the characters were removed:
" ' //./" ! # # # …
推荐答案
您使用的符号不正确对于非BMP unicode点;您要使用 \U0001FFFF
,大写 U
和8位数字:
You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF
, a capital U
and 8 digits:
myre = re.compile(u'['
u'\U0001F300-\U0001F5FF'
u'\U0001F600-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]+',
re.UNICODE)
这可以简化为:
myre = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]+',
re.UNICODE)
,因为前两个范围是相邻的。
as your first two ranges are adjacent.
您的版本正在指定(添加了可读性的空格):
Your version was specifying (with added spaces for readability):
[\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+
code> \uxxxx 转义序列始终仅使用4个十六进制数字,而不是5。
That's because the \uxxxx
escape sequence always takes only 4 hex digits, not 5.
这些范围是 0-\u1F6F
(因此从数字 0
到 Ὧ
),其中包含很大的一幅
The largest of those ranges is 0-\u1F6F
(so from the digit 0
through to Ὧ
), which covers a very large swathe of the Unicode standard.
只要您使用UCS-4宽的Python可执行文件,更正后的表达式就可以工作:
The corrected expression works, provided you use a UCS-4 wide Python executable:
>>> import re
>>> myre = re.compile(u'['
... u'\U0001F300-\U0001F64F'
... u'\U0001F680-\U0001F6FF'
... u'\u2600-\u26FF\u2700-\u27BF]+',
... re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a')
u'Some example text with a sleepy face: '
UCS-2等效项是
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
u'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)
您可以将两者与异常处理程序合并到脚本中:
You can combine the two into your script with a exception handler:
try:
# Wide UCS-4 build
myre = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]+',
re.UNICODE)
except re.error:
# Narrow UCS-2 build
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
u'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)
当然,正则表达式已经过时了,因为它不包含新Unicode版本中定义的Emoji。它似乎涵盖了Emoji定义的Unicode 8.0之前的版本(因为在Unicode 9.0中添加了 U + 1F91D HANDSHAKE )。
Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).
如果您需要更新的正则表达式,请从一个正在积极尝试使表情符号保持最新状态的软件包;它专门支持生成这样的正则表达式:
If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
当前软件包是Unicode 11.0的最新版本,并具有可以快速更新到将来版本的基础结构。您的项目所要做的就是在新版本发布时进行升级。
The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.
这篇关于在python中使用re删除unicode表情符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!