高效的字符串相互包含 [英] Efficient strings containing each other
问题描述
我有两组字符串(A
和 B
),我想知道 A 中的所有字符串 a
和 b in B
其中 a
是 b
的子串.
I have two sets of strings (A
and B
), and I want to know all pairs of strings a in A
and b in B
where a
is a substring of b
.
编码的第一步如下:
for a in A:
for b in B:
if a in b:
print (a,b)
但是,我想知道 - 是否有更有效的方法来使用正则表达式(例如,不是检查 if a in b:
,而是检查正则表达式 '.*' + a + '.*':
匹配 'b'.我想也许使用这样的东西会让我缓存 Knuth-Morris-Pratt 失效函数.此外,对内部使用列表理解for b in B:
循环可能会带来相当大的加速(嵌套列表理解可能会更好).
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:
, check if the regexp '.*' + a + '.*':
matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a
. Also, using a list comprehension for the inner for b in B:
loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
我对在算法的渐近运行时实现巨大飞跃(例如使用后缀树或其他任何复杂而聪明的东西)并不是很感兴趣.我更关心常量(我只需要对几对 A
和 B
集执行此操作,我不希望它整周都运行).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A
and B
sets, and I don't want it to run all week).
您是否知道任何技巧或有任何通用建议可以更快地做到这一点?非常感谢您可以分享的任何见解!
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
根据@ninjagecko 和@Sven Marnach 的建议,我建立了一个 10-mers 的快速前缀表:
Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)
推荐答案
当然,您可以轻松地将其编写为列表推导式:
Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
这可能会稍微加快循环速度,但不要期望过高.我怀疑使用正则表达式对这个问题有任何帮助.
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
编辑:以下是一些时间:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
结果:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
编辑 2:向计时添加了 ninjagecko 建议的算法的变体.你可以看到它比所有的蛮力方法要好得多.
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
编辑 3: 使用集合而不是列表来消除重复项.(我没有更新时间——它们基本上保持不变.)
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)
这篇关于高效的字符串相互包含的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!