在一个字符串中查找多个子字符串,而无需对其进行多次遍历 [英] Finding multiple substrings in a string without iterating over it multiple times

查看:67
本文介绍了在一个字符串中查找多个子字符串,而无需对其进行多次遍历的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要查找列表中的项目是否出现在字符串中,然后将其添加到其他列表中.这段代码有效:

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

但是,代码多次遍历line(可能很长)-与_legal中的项目(可能是 lot )一样多.这对我来说太慢了,我正在寻找一种更快的方法. line没有任何特定的格式,据我所知,使用.split()无效. 更改了line,以便更好地表示问题.

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. changed line so that it better represents the problems.

推荐答案

我想改善的一种方法是:

One way I could think of to improve is:

  • 获取_legal
  • 中所有单词的唯一长度
  • 使用滑动窗口技术从这些特定长度的line中构建单词词典.复杂度应该为O( len(line)*num_of_unique_lengths ),应该比蛮力好.
  • 现在在O(1)中的字典中查找每个thing.
  • Get all unique lengths of the words in _legal
  • Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
  • Now look for each thing in the dictionary in O(1).

代码:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

输出:

{'thing1', 'fj la', 'thing2', 't5', '5'}

这篇关于在一个字符串中查找多个子字符串,而无需对其进行多次遍历的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆