在一个字符串中查找多个子字符串,而无需对其进行多次遍历 [英] Finding multiple substrings in a string without iterating over it multiple times
问题描述
我需要查找列表中的项目是否出现在字符串中,然后将其添加到其他列表中.这段代码有效:
I need to find if items from a list appear in a string, and then add the items to a different list. This code works:
data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]
for i in _legal:
if i in line:
data.append(i)
但是,代码多次遍历line
(可能很长)-与_legal
中的项目(可能是 lot )一样多.这对我来说太慢了,我正在寻找一种更快的方法. line
没有任何特定的格式,据我所知,使用.split()
无效.
更改了line
,以便更好地表示问题.
However, the code iterates over line
(which could be long) multiple times- as many times as there are item in _legal
(which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line
doesn't have any specific format, so using .split()
couldn't work, as far as I know.
changed line
so that it better represents the problems.
推荐答案
我想改善的一种方法是:
One way I could think of to improve is:
- 获取
_legal
中所有单词的唯一长度
- 使用滑动窗口技术从这些特定长度的
line
中构建单词词典.复杂度应该为O( len(line)*num_of_unique_lengths )
,应该比蛮力好. - 现在在O(1)中的字典中查找每个
thing
.
- Get all unique lengths of the words in
_legal
- Build a dictionary of words from
line
of those particular lengths using a sliding window technique. The complexity should beO( len(line)*num_of_unique_lengths )
, this should be better than brute force. - Now look for each
thing
in the dictionary in O(1).
代码:
line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))
输出:
{'thing1', 'fj la', 'thing2', 't5', '5'}
这篇关于在一个字符串中查找多个子字符串,而无需对其进行多次遍历的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!