在python中使用re删除unicode表情符号 [英] remove unicode emoji using re in python

查看:456
本文介绍了在python中使用re删除unicode表情符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从Unicode鸣叫文本中删除表情符号,并使用python 2.7打印结果。

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u'[\u1F300-\u1F5FF\u1F600-\u1F64F\u1F680-\u1F6FF\u2600-\u26FF\u2700-\u27BF]+',re.UNICODE)
print myre.sub('', text)

但似乎几乎所有字符都是从文本中删除。我检查了其他帖子的几个答案,很遗憾,这些答案都无法在此工作。我在re.compile()中做任何事情吗?

but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?

下面是一个示例输出,其中所有字符均被删除:

here is an example output that all the characters were removed:

"   '   //./" ! # # # …


推荐答案

您使用的符号不正确对于非BMP unicode点;您要使用 \U0001FFFF 大写 U 和8位数字:

You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF, a capital U and 8 digits:

myre = re.compile(u'['
    u'\U0001F300-\U0001F5FF'
    u'\U0001F600-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

这可以简化为:

myre = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

,因为前两个范围是相邻的。

as your first two ranges are adjacent.

您的版本正在指定(添加了可读性的空格):

Your version was specifying (with added spaces for readability):

[\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+

code> \uxxxx 转义序列始终仅使用4个十六进制数字,而不是5。

That's because the \uxxxx escape sequence always takes only 4 hex digits, not 5.

这些范围是 0-\u1F6F (因此从数字 0 ),其中包含很大的一幅

The largest of those ranges is 0-\u1F6F (so from the digit 0 through to ), which covers a very large swathe of the Unicode standard.

只要您使用UCS-4宽的Python可执行文件,更正后的表达式就可以工作:

The corrected expression works, provided you use a UCS-4 wide Python executable:

>>> import re
>>> myre = re.compile(u'['
...     u'\U0001F300-\U0001F64F'
...     u'\U0001F680-\U0001F6FF'
...     u'\u2600-\u26FF\u2700-\u27BF]+', 
...     re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a')
u'Some example text with a sleepy face: '

UCS-2等效项是

myre = re.compile(u'('
    u'\ud83c[\udf00-\udfff]|'
    u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
    u'[\u2600-\u26FF\u2700-\u27BF])+', 
    re.UNICODE)

您可以将两者与异常处理程序合并到脚本中:

You can combine the two into your script with a exception handler:

try:
    # Wide UCS-4 build
    myre = re.compile(u'['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    myre = re.compile(u'('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+', 
        re.UNICODE)

当然,正则表达式已经过时了,因为它不包含新Unicode版本中定义的Emoji。它似乎涵盖了Emoji定义的Unicode 8.0之前的版本(因为在Unicode 9.0中添加了 U + 1F91D HANDSHAKE )。

Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).

如果您需要更新的正则表达式,请从一个正在积极尝试使表情符号保持最新状态的软件包;它专门支持生成这样的正则表达式:

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

当前软件包是Unicode 11.0的最新版本,并具有可以快速更新到将来版本的基础结构。您的项目所要做的就是在新版本发布时进行升级。

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

这篇关于在python中使用re删除unicode表情符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆