从多语言Unicode文本中删除Emoji； [英] Remove Emoji's from multilingual Unicode text

查看：0 发布时间：2022/9/21 18:08:47 python regex string unicode emoji

本文介绍了从多语言Unicode文本中删除Emoji；的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从Unicode文本中删除只Emoji。我尝试了各种方法described in another Stack Overflow post，但都不能完全删除所有的表情包/笑脸。例如：

解决方案1：

def remove_emoji(self, string):
    emoji_pattern = re.compile("["
                           u"U0001F600-U0001F64F"  # emoticons
                           u"U0001F300-U0001F5FF"  # symbols & pictographs
                           u"U0001F680-U0001F6FF"  # transport & map symbols
                           u"U0001F1E0-U0001F1FF"  # flags (iOS)
                           u"U00002702-U000027B0"
                           u"U000024C2-U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

下例中的🤝中的叶子：

Input: తెలంగాణ రియల్ ఎస్టేట్ 🤝👍
Output: తెలంగాణ రియల్ ఎస్టేట్ 🤝

另一次尝试，解决方案2：

def deEmojify(self, inputString):
    returnString = ""
    for character in inputString:
        try:
            character.encode("ascii")
            returnString += character
        except UnicodeEncodeError:
            returnString += ''
    return returnString

删除任何非英文字符的结果：

 Input: 🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍
 Output: Test  A.P&T.S.

它不仅删除了所有表情符号，还删除了非英语字符，因为character.encode("ascii")；我的非英语输入无法编码为ASCII。

是否有从国际Unicode文本中正确删除Emoji的方法？

推荐答案

正则表达式已过时。它似乎涵盖了Unicode 8.0之前定义的Emoji(因为U+1F91D HANDSHAKE是在Unicode 9.0中添加的)。另一种方法只是一种非常低效的强制编码为ASCII的方法，在仅删除Emoji时很少需要这种方法(使用text.encode('ascii', 'ignore').decode('ascii')可以更容易、更高效地实现)。

如果您需要更新的正则表达式，请从a package that is actively trying to keep up-to-date on Emoji中选择一个；它专门支持生成这样的正则表达式：

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

该程序包当前是Unicode 11.0的最新版本，并且具有快速更新到未来版本的基础设施。您的项目所要做的就是在有新版本时进行升级。

使用您的示例输入进行演示：

>>> print(remove_emoji(u'తెలంగాణ రియల్ ఎస్టేట్ 🤝👍'))
తెలంగాణ రియల్ ఎస్టేట్ 
>>> print(remove_emoji(u'🏣Testరియల్ ఎస్టేట్ A.P&T.S. 🤝🏩🏣👍'))
Testరియల్ ఎస్టేట్ A.P&T.S.

请注意，正则表达式适用于Unicode文本，对于Python2，请确保您已将str解码为unicode，对于Python3，请确保从bytes解码为str第一个。

表情符号现在是复杂的野兽。以上操作将删除完整、有效的Emoji。如果你有像skin-tone codepoints这样的"不完整的"Emoji组件(只能与特定的Emoji结合使用)，那么删除这些组件就会有更大的困难。肤色代码点很简单(只需随后删除这5个代码点)，但有一个whole host of combinations是由一些无辜的字符组成的，比如♀U+2640女性手势或♂U+2642男性手势，以及variant selectors和U+200D ZERO-WIDTH JOINER，它们在其他上下文中也有特定的含义，而且您不能只对它们进行正则表达式，除非您不介意使用天成文书、卡纳达或CJK表意文字，仅举几个例子。

也就是说，以下Unicode 11.0代码点可能可以安全删除(基于过滤Emoji_Component Emoji-data designation)：

20E3          ;  (⃣)     combining enclosing keycap
FE0F          ; ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; (🇦..🇿)  regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; (🏻..🏿)  light skin tone..dark skin tone
1F9B0..1F9B3  ; (🦰..🦳) red-haired..white-haired
E0020..E007F  ; (󠀠..󠁿)      tag space..cancel tag

可以通过创建新的正则表达式来将其删除以匹配这些值：

import re
try:
    uchr = unichr  # Python 2
    import sys
    if sys.maxunicode == 0xffff:
        # narrow build, define alternative unichr encoding to surrogate pairs
        # as unichr(sys.maxunicode + 1) fails.
        def uchr(codepoint):
            return (
                unichr(codepoint) if codepoint <= sys.maxunicode else
                unichr(codepoint - 0x010000 >> 10 | 0xD800) +
                unichr(codepoint & 0x3FF | 0xDC00)
            )
except NameError:
    uchr = chr  # Python 3

# Unicode 11.0 Emoji Component map (deemed safe to remove)
_removable_emoji_components = (
    (0x20E3, 0xFE0F),             # combining enclosing keycap, VARIATION SELECTOR-16
    range(0x1F1E6, 0x1F1FF + 1),  # regional indicator symbol letter a..regional indicator symbol letter z
    range(0x1F3FB, 0x1F3FF + 1),  # light skin tone..dark skin tone
    range(0x1F9B0, 0x1F9B3 + 1),  # red-haired..white-haired
    range(0xE0020, 0xE007F + 1),  # tag space..cancel tag
)
emoji_components = re.compile(u'({})'.format(u'|'.join([
    re.escape(uchr(c)) for r in _removable_emoji_components for c in r])),
    flags=re.UNICODE)

然后更新上面的remove_emoji()函数以使用它：

def remove_emoji(text, remove_components=False):
    cleaned = emoji.get_emoji_regexp().sub(u'', text)
    if remove_components:
        cleaned = emoji_components.sub(u'', cleaned)
    return cleaned

这篇关于从多语言Unicode文本中删除Emoji；的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从多语言Unicode文本中删除Emoji； [英] Remove Emoji's from multilingual Unicode text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从多语言Unicode文本中删除Emoji； [英] Remove Emoji&#39;s from multilingual Unicode text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

从多语言Unicode文本中删除Emoji； [英] Remove Emoji's from multilingual Unicode text

登录关闭