从列表中的一个字符串中搜索任何单词或单词组合(python) [英] Search for any word or combination of words from one string in a list (python)

查看:536
本文介绍了从列表中的一个字符串中搜索任何单词或单词组合(python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串(例如:"alpha beta charlie, delta&epsilon foxtrot")和一个列表(例如["zero","omega virginia","apple beta charlie"]).是否有方便的方法来遍历字符串中的每个单词和单词组合以在列表中进行搜索?

I have a string (for example: "alpha beta charlie, delta&epsilon foxtrot") and a list (for example ["zero","omega virginia","apple beta charlie"]). Is there a convenient way to iterate through every word and combination of words in the string in order to search for it in the list?

推荐答案

目的

您说的是组合,但是组合在语义上是无序的,这意味着要查找所有有序排列的交集,并在其中插入带有目标列表的空格.

Purpose

You're saying combinations, but combinations are semantically unordered, what you mean, is you intend to find the intersection of all ordered permutations joined by spaces with a target list.

首先,我们需要导入要使用的库.

To begin with, we need to import the libraries we intend to use.

import re
import itertools

分割字符串

不要分割字符,您正在对不包含奇怪字符的单词进行语义搜索. 由re模块提供支持的正则表达式非常适合此操作.在原始 Python字符串r'',对于任何大于或等于1的字母数字字符(和_)\w,我们将正则表达式用于单词\b的边缘. .

Splitting the string

Don't split on characters, you're doing a semantic search for words exclusive of strange characters. Regular expressions, powered by the re module are perfect for this. In a raw Python string, r'', we use the regular expression for the edge of a word, \b, around any alphanumeric character (and _), \w, of number greater than or equal to one, +.

re.findall返回每个匹配项的列表.

re.findall returns a list of every match.

re_pattern = r'\b\w+\b'
silly_string = 'alpha beta charlie, delta&epsilon foxtrot'
words = re.findall(re_pattern, silly_string)

在这里,单词是我们的单词列表:

Here, words is our wordlist:

>>> print words
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot']

创建排列

继续,我们更喜欢使用生成器来处理我们的数据,以避免在需要数据之前不必要地物化数据并将大型数据集保存在内存中. itertools库具有一些不错的功能,这些功能恰好可以满足我们提供上述单词的所有排列并将它们链接在单个可迭代中的需求:

Creating the Permutations

Continuing, we prefer to manipulate our data with generators to avoid unnecessarily materializing data before we need it and holding large datasets in memory. The itertools library has some nice functions that neatly suit our needs for providing all permutations of the above words and chaining them in a single iterable:

_gen = (itertools.permutations(words, i + 1) for i in xrange(len(words)))
all_permutations_gen = itertools.chain(*_gen)

list(all_permutations_gen)列出all_permutations_gen会给我们:

listing all_permutations_gen with list(all_permutations_gen) would give us:

[('alpha',),('beta',),('charlie',),('delta',),('epsilon',), ('foxtrot',),('alpha','beta'),('alpha','charlie'),('alpha', 'delta'),('alpha','epsilon'),('alpha','foxtrot'),('beta', 'alpha'),('beta','charlie'),('beta','delta'),('beta','epsilon'), ('beta','foxtrot'),('charlie','alpha'),('charlie','beta'), ('charlie','delta'),('charlie','epsilon'),('charlie','foxtrot'), ('delta','alpha'),('delta','beta'),('delta','charlie'),('delta', 'epsilon'),('delta','foxtrot'),('epsilon','alpha'),('epsilon' "beta"),("epsilon","charlie"),("epsilon","delta"),("epsilon", ''foxtrot'),('foxtrot','alpha'),('foxtrot','beta'),('foxtrot', 'charlie'),('foxtrot','delta'),('foxtrot','epsilon'),('alpha', 'beta','charlie'),('alpha','beta','delta'),...

[('alpha',), ('beta',), ('charlie',), ('delta',), ('epsilon',), ('foxtrot',), ('alpha', 'beta'), ('alpha', 'charlie'), ('alpha', 'delta'), ('alpha', 'epsilon'), ('alpha', 'foxtrot'), ('beta', 'alpha'), ('beta', 'charlie'), ('beta', 'delta'), ('beta', 'epsilon'), ('beta', 'foxtrot'), ('charlie', 'alpha'), ('charlie', 'beta'), ('charlie', 'delta'), ('charlie', 'epsilon'), ('charlie', 'foxtrot'), ('delta', 'alpha'), ('delta', 'beta'), ('delta', 'charlie'), ('delta', 'epsilon'), ('delta', 'foxtrot'), ('epsilon', 'alpha'), ('epsilon', 'beta'), ('epsilon', 'charlie'), ('epsilon', 'delta'), ('epsilon', 'foxtrot'), ('foxtrot', 'alpha'), ('foxtrot', 'beta'), ('foxtrot', 'charlie'), ('foxtrot', 'delta'), ('foxtrot', 'epsilon'), ('alpha', 'beta', 'charlie'), ('alpha', 'beta', 'delta'), ...

如果我们将生成器具体化为列表而不是集合,则打印前20个项目将向我们显示:

If we materialized the generator in a list instead of a set, printing the first 20 items would show us:

>>> print all_permutations[:20] # this only works if you cast as a list instead
['alpha', 'beta', 'charlie', 'delta', 'epsilon', 'foxtrot', 'alpha beta', 'alpha charlie', 'alpha delta', 'alpha epsilon', 'alpha foxtrot', 'beta alpha', 'beta charlie', 'beta delta', 'beta epsilon', 'beta foxtrot', 'charlie alpha', 'charlie beta', 'charlie delta', 'charlie epsilon']

但是那将在我们准备好之前耗尽发电机的能量.因此,现在我们得到了这些单词的所有排列的集合

But that would exhaust the generator before we're ready. So instead, now we get the set of all permutations of those words

all_permutations = set(' '.join(i) for i in all_permutations_gen)

检查目标列表中任何排列的成员资格

因此,我们现在可以搜索与目标列表的交点了:

Checking for Membership of any Permutations in Target List

So we see with this we can now search for an intersection with the target list:

>>> target_list = ["zero","omega virginia","apple beta charlie"]
>>> all_permutations.intersection(target_list)
set([])

在这种情况下,对于给定的示例,我们得到了一个空集,但是如果我们在置换集中的目标中有一个字符串:

And in this case, for the examples given, we get the empty set, but if we have a string in the target that's in our set of permutations:

>>> target_list_2 = ["apple beta charlie", "foxtrot alpha beta charlie"]
>>> all_permutations.intersection(target_list_2)
set(['foxtrot alpha beta charlie'])

这篇关于从列表中的一个字符串中搜索任何单词或单词组合(python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆