Python-快速计算字符串列表中以文本开头的单词 [英] Python - Fast count words in text from list of strings and that start with
问题描述
我知道已经问过几次类似的问题,但是我的问题有点不同,我正在寻找使用Python的省时的解决方案.
I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.
我有一组单词,其中一些以"*"结尾.而另一些则没有:
I have a set of words, some of them end with the "*" and some others don't:
words = set(["apple", "cat*", "dog"])
考虑到星号后的所有内容,我必须在文本中计算它们的总出现次数("cat *"表示所有以"cat"开头的单词).搜索必须不区分大小写.考虑以下示例:
I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".
我希望最终得分为4 (=猫* x 2 +狗+苹果).请注意,"cat *"同样考虑到复数的本"已经被计算了两次,而苹果"指的是本".已被计算一次,因为未考虑其复数(末尾没有星号).
I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).
我必须在大量文档上重复此操作,因此我需要一个快速的解决方案.我不知道正则表达式或flashtext是否可以达到快速解决方案.你能帮我吗?
I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?
编辑
我忘记提及我的某些单词包含标点符号,请参见此处,例如:
I forgot to mention thas some of my words contain punctuation, see here for e.g.:
words = set(["apple", "cat*", "dog", ":)", "I've"])
这似乎在编译正则表达式时产生了其他问题.您已经提供的代码中是否存在一些集成功能,可以同时使用这两个附加词?
This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?
推荐答案
您可以使用regex进行此操作,从一组单词中创建一个regex,将单词边界置于单词周围,但将结尾的单词边界留在结尾的单词上与 *
.编译正则表达式应有助于提高性能:
You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *
. Compiling the regex should help performance:
import re
words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"
regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))
输出:
4
这篇关于Python-快速计算字符串列表中以文本开头的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!