如何使用python查找和计算字符串中的表情符号? [英] How to find and count emoticons in a string using python?

查看:41
本文介绍了如何使用python查找和计算字符串中的表情符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已在 可以包括一系列字符,通过指定第一个和最后一个,中间有一个连字符.您可以使用 \U 转义序列指定您不知道如何键入的 Unicode 字符.所以:

导入重新s=u"笑脸表情太棒了!\U0001f600 我喜欢你.\U0001f601"计数 = len(re.findall(ru'[\U0001f600-\U0001f650]', s))

或者,如果字符串足够大以至于构建整个 findall 列表似乎很浪费:

emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)count = sum(1 for _ in 表情符号)

数词,可以单独做:

wordcount = len(s.split())

如果您想一次性完成所有操作,可以使用交替组:

word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))

<小时>

正如@strangefeatures 指出的那样,3.3 之前的 Python 版本允许窄 Unicode"构建.而且,例如,大多数 CPython Windows 版本都很窄.在窄版本中,字符只能在 U+0000U+FFFF 范围内.无法搜索这些字符,但没关系,因为它们不存在可搜索;如果您在编译正则表达式时遇到无效范围"错误,您可以假设它们不存在.

当然,除了很有可能从哪里获取实际字符串的地方,它们是 UTF-16-BE 或 UTF-16-LE,所以字符存在,它们只是被编码成代理对.你想匹配那些代理对,对吗?因此,您需要将搜索转换为代理对搜索.也就是说,将您的高低代码点转换为代理对代码单元,然后(用 Python 术语)搜索:

(lead == low_lead and Lead != high_lead and low_trail <= trail <= DFFF 或领先 == high_lead 和领先 != low_lead 和 DC00 <= trail <= high_trail 或低铅<铅<high_lead 和 DC00 <= 尾迹 <= DFFF)

如果您不担心接受伪造的 UTF-16,您可以省略最后一种情况的第二个条件.

如果不清楚如何将其转换为正则表达式,以下是 UTF-16-BE 中范围 [\U0001e050-\U0001fbbf] 的示例:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])

当然,如果您的范围足够小,low_lead == high_lead 这会变得更简单.例如,可以使用以下命令搜索原始问题的范围:

\ud83d[\ude00-\ude50]

最后一个技巧,如果您实际上不知道要获取 UTF-16-LE 还是 UTF-16-BE(并且 BOM 与您要搜索的数据相距甚远):因为没有代理前导或尾随代码单元作为独立字符或一对的另一端有效,您可以在两个方向上搜索:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.

Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.

The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:

$cat <file containing the strings with emoticons> | ./emo.py

emo.py psuedo script:

import re
import sys

for row in sys.stdin:
    print row.decode('utf-8').encode("ascii","replace")
    #insert regex to find the emoticons
    if match:
       #do some counting using .split(" ")
       #print the counting

The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:

"Smiley emoticon rocks! I like you."

The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.

解决方案

First, there is no need to encode here at all. You're got a Unicode string, and the re engine can handle Unicode, so just use it.

A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U escape sequences. So:

import re

s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))

Or, if the string is big enough that building up the whole findall list seems wasteful:

emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)

Counting words, you can do separately:

wordcount = len(s.split())

If you want to do it all at once, you can use an alternation group:

word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))


As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000 to U+FFFF. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.

Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:

(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
 lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
 low_lead < lead < high_lead and DC00 <= trail <= DFFF)

You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.

If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf] in UTF-16-BE:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])

Of course if your range is small enough that low_lead == high_lead this gets simpler. For example, the original question's range can be searched with:

\ud83d[\ude00-\ude50]

One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:

(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)

这篇关于如何使用python查找和计算字符串中的表情符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆