Python:如何确定字符串中是否存在单词列表 [英] Python: how to determine if a list of words exist in a string

查看:56
本文介绍了Python:如何确定字符串中是否存在单词列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个列表["one", "two", "three"],如何判断每个单词是否存在于指定的字符串中?

Given a list ["one", "two", "three"], how to determine if each word exist in a specified string?

单词列表很短(在我的例子中不到 20 个单词),但要搜索的字符串非常大(每次运行 400,000 个字符串)

The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)

我当前的实现使用 re 来查找匹配项,但我不确定这是否是最好的方法.

My current implementation uses re to look for matches but I'm not sure if it's the best way.

import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)

finder = re.compile(regex_string)
string_to_be_searched = "one two three"

results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
    if word in result_set:
        print("%s in string" % word)

我的解决方案中的问题:

Problems in my solution:

  1. 它会搜索到字符串的末尾,尽管单词可能出现在字符串的前半部分
  2. 为了克服lookahead assertion的限制(不知道如何表达当前匹配之前的字符应该是非单词字符,或者字符串的开头"),我在前后加了额外的空格需要搜索的字符串.
  3. 前瞻断言引入的其他性能问题?

可能更简单的实现:

  1. 只需遍历单词列表并在 string_to_be_searched 中执行 if 单词.但是如果你要找三个",它就不能处理三人行"
  2. 使用一个正则表达式搜索一个词.我仍然不确定性能以及多次搜索字符串的潜力.
  1. just loop through the word list and do a if word in string_to_be_searched. But it can not deal with "threesome" if you are looking for "three"
  2. Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.

更新:

我已接受 Aaron Hall 的回答 https://stackoverflow.com/a/21718896/683321 因为根据Peter Gibson 的基准测试 https://stackoverflow.com/a/21742190/683321 这个简单的版本性能最好.如果您对此问题感兴趣,可以阅读所有答案并获得更好的视图.

I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.

实际上我忘了在我原来的问题中提到另一个约束.单词可以是短语,例如:word_list = ["one day", "second day"].也许我应该问另一个问题.

Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]. Maybe I should ask another question.

推荐答案

Peter Gibson(见下文)发现此函数是此处答案中性能最高的.这对于可能保存在内存中的数据集很有用(因为它从要搜索的字符串中创建了一个单词列表,然后是一组这些单词):

This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

用法:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

One or words found! 打印到标准输出.

Which prints One or words found! to stdout.

确实返回找到的实际单词:

It does return the actual words found:

for word in words_in_string(my_word_list, a_string):
    print(word)

打印出来:

three
two
one

<小时>

对于如此大的数据,您无法将其保存在内存中,此答案中给出的解决方案将非常高效.

这篇关于Python:如何确定字符串中是否存在单词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆