根据另一个列表中的值搜索列表 [英] Searching a list based on values in another list
问题描述
我有一个要从字符串列表中拉出的名称列表.我不断收到误报,例如部分比赛.另一个警告是,我希望它在适用的情况下也能获得一个姓氏.
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']
我尝试了以下代码:
[i for e in names for i in target if i.startswith(e)]
可以预见的是,克里斯·史密斯(Chris Smith),圣诞节到了,金伯利(Kimberly).
我如何最好地解决这个问题?使用正则表达式还是可以使用列表推导来完成?由于实名列表的长度约为880,000个,因此性能可能会成为问题.
(python 2.7)
编辑:我已经意识到,考虑到不希望在圣诞节期间加入金伯利的可能性,我在本示例中的标准是不现实的.为了缓解这个问题,我找到了一个更完整的名称列表,其中包括变体(包括Kim和Kimberly).
(再次)完全猜测,因为我看不到如何不能给出Christmas is here
给出任何合理的标准:
这将匹配具有以名称中的单词开头的单词的任何目标...
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
import re
matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)]
print matches
# ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']
如果将其更改为\b{}\b' - then you'll get ['Chris Smith', 'CHRIS']
,则会丢失Kim
...
I have a list of names which I'm trying to pull out of a list of strings. I keep getting false positives such as partial matches. The other caveat is that I'd like it to also grab a last name where applicable.
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
desired_output = ['Chris Smith', 'Kimberly', 'CHRIS']
I've tried this code:
[i for e in names for i in target if i.startswith(e)]
This predictably returns Chris Smith, Christmas is here, and Kimberly.
How would I best approach this? Using regex or can it be done with list comprehensions? Performance may be an issue as the real names list is ~880,000 names long.
(python 2.7)
EDIT: I've realized that my criteria in this example are unrealistic given that the impossible request of wanting to include Kimberly while excluding Christmas is here. To mitigate this issue, I've found a more complete names list which would include variations (both Kim and Kimberly are included).
Complete guess (again) since I can't see how you can not have Christmas is here
given any reasonable criteria:
This'll match any targets that have any word that starts with a word from names...
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kimberly','Christmas is here', 'CHRIS']
import re
matches = [targ for targ in target if any(re.search(r'\b{}'.format(name), targ, re.I) for name in names)]
print matches
# ['Chris Smith', 'Kimberly', 'Christmas is here', 'CHRIS']
If you change it to \b{}\b' - then you'll get ['Chris Smith', 'CHRIS']
so you lose Kim
...
这篇关于根据另一个列表中的值搜索列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!