Python-快速计算字符串列表中以文本开头的单词 [英] Python - Fast count words in text from list of strings and that start with

查看:66
本文介绍了Python-快速计算字符串列表中以文本开头的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道已经问过几次类似的问题,但是我的问题有点不同,我正在寻找使用Python的省时的解决方案.

I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.

我有一组单词,其中一些以"*"结尾.而另一些则没有:

I have a set of words, some of them end with the "*" and some others don't:

words = set(["apple", "cat*", "dog"])

考虑到星号后的所有内容,我必须在文本中计算它们的总出现次数("cat *"表示所有以"cat"开头的单词).搜索必须不区分大小写.考虑以下示例:

I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:

text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".

我希望最终得分为4 (=猫* x 2 +狗+苹果).请注意,"cat *"同样考虑到复数的本"已经被计算了两次,而苹果"指的是本".已被计算一次,因为未考虑其复数(末尾没有星号).

I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).

我必须在大量文档上重复此操作,因此我需要一个快速的解决方案.我不知道正则表达式或flashtext是否可以达到快速解决方案.你能帮我吗?

I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?

编辑

我忘记提及我的某些单词包含标点符号,请参见此处,例如:

I forgot to mention thas some of my words contain punctuation, see here for e.g.:

words = set(["apple", "cat*", "dog", ":)", "I've"])

这似乎在编译正则表达式时产生了其他问题.您已经提供的代码中是否存在一些集成功能,可以同时使用这两个附加词?

This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?

推荐答案

您可以使用regex进行此操作,从一组单词中创建一个regex,将单词边界置于单词周围,但将结尾的单词边界留在结尾的单词上与 * .编译正则表达式应有助于提高性能:

You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *. Compiling the regex should help performance:

import re

words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))

输出:

4

这篇关于Python-快速计算字符串列表中以文本开头的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆