Python:在字符串列表中优化搜索子字符串 [英] Python: optimal search for substring in list of strings

查看:36
本文介绍了Python:在字符串列表中优化搜索子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的问题,我想在许多字符串的列表中搜索许多子字符串.以下是我想要做的事情的要点:

I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do:

listStrings = [ACDE, CDDE, BPLL, ... ]

listSubstrings = [ACD, BPI, KLJ, ...]

以上条目只是示例.len(listStrings) 约为 60,000,len(listSubstrings) 约为 50,000-300,000,len(listStrings[i]) 约为 10 到 30,000.

The above entries are just examples. len(listStrings) is ~ 60,000, len(listSubstrings) is ~50,000-300,000, and len(listStrings[i]) is anywhere from 10 to 30,000.

我目前的 Python 尝试是:

My current Python attempt is:

for i in listSubstrings:
   for j in listStrings:
       if i in j:
          w.write(i+j)

或者类似的东西.虽然这适用于我的任务,但速度非常慢,使用一个核心并花费 40 分钟来完成任务.有没有办法加快速度?

Or something along these lines. While this works for my task, it's horribly slow, using one core and taking on the order of 40 minutes to complete the task. Is there a way to speed this up?

我不相信我可以用 listStrings:listSubstrings 做一个 dict,因为有可能需要在两端存储重复的条目(尽管如果我能找到一种方法来附加一个每个标签都有唯一的标签,因为字典要快得多).同样,我认为我无法预先计算可能的子字符串.我什至不知道搜索 dict 键是否比搜索列表更快(因为 dict.get() 将提供特定输入而不是查找子输入).相对而言,在内存中搜索列表就那么慢吗?

I don't believe that I can make a dict out of listStrings:listSubstrings because there is the possibility of duplicate entries which need to be stored on both ends (although I may try this if I can find a way to append a unique tag to each one, since dicts are so much faster). Similarly, I don't think I can pre-compute possible substrings. I don't even know if searching dict keys is faster than searching a list (since dict.get() is going to give the specific input and not look for sub-inputs). Is searching lists in memory just that slow relatively speaking?

推荐答案

也许你可以尝试将两个列表中的一个(最大的?虽然我会根据直觉将 listStrings)分成更小的部分,然后使用线程并行运行这些搜索(Poolmultiprocessing 提供了一种方便的方法来做到这一点)?我使用以下内容进行了一些显着的加速:

Maybe you can try to chunk one of the two list (the biggest ? although intuitively I would cut listStrings) in smaller ones then use threading to run these search in parallel (the Pool class of multiprocessing offers a convenient way to do this) ? I had some significant speed-up using something like :

from multiprocessing import Pool
from itertools import chain, islice

# The function to be run in parallel :
def my_func(strings):
    return [j+i for i in strings for j in listSubstrings if i.find(j)>-1]

# A small recipe from itertools to chunk an iterable :
def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

# Generating some fake & random value :
from random import randint
listStrings = \
    [''.join([chr(randint(65, 90)) for i in range(randint(1, 500))]) for j in range(10000)]
listSubstrings = \
    [''.join([chr(randint(65, 90)) for i in range(randint(1, 100))]) for j in range(1000)]

# You have to prepare the searches to be performed:
prep = [strings for strings in chunk(listStrings, round(len(listStrings) / 8))]
with Pool(4) as mp_pool:
    # multiprocessing.map is a parallel version of map()
    res = mp_pool.map(my_func, prep)
# The `res` variable is a list of list, so now you concatenate them
# in order to have a flat result list
result = list(chain.from_iterable(res))

然后你可以写整个 result 变量(而不是逐行写):

Then you could write the whole result variable (instead of writing it line by lines) :

with open('result_file', 'w') as f:
    f.write('\n'.join(result))

编辑 01/05/18:使用 itertools.chain.from_iterable 而不是使用 map 副作用的丑陋解决方法,遵循 ShadowRanger 的建议.

Edit 01/05/18: flatten the result using itertools.chain.from_iterable instead of a ugly workaround using map side-effects, following ShadowRanger's advice.

这篇关于Python:在字符串列表中优化搜索子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆