Python中具有间隙约束的非重叠模式匹配 [英] Non overlapping pattern matching with gap constraint in python

查看:111
本文介绍了Python中具有间隙约束的非重叠模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到总数。模式的非重叠匹配出现在序列中,且间隔限制为2。

I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2.

例如。 2982f 2982l 2981l 是使用某种算法找到的模式。我必须找到出现在诸如 2982f 2982f 2982l 2982l 2981l 3111m 3171m 3171f 2982f 2982l 2981l………$code>之类的序列中的总数,其中最大间隙约束为2。

Eg. 2982f 2982l 2981l is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as 2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l … , where the max gap constraint is 2.

空白约束2表示在 2982f 2982l 2981l 模式之间,最多允许2个其他单词。而且,最主要的是所有这些匹配都应该不重叠。

Gap constraint 2 means that between the pattern 2982f 2982l 2981l , maximum of 2 other words allowed. And, the main thing is all these matches should be non-overlapping.

例如对于模式' 2982f 2982l 2981l 依次为 2982f 2982f 2982l 2982l 2982l 2981l

E.g. For pattern '2982f 2982l 2981l in sequence 2982f 2982f 2982l 2982l 2981l :


  • 2982f 2982f 2982l 2982l 2981l 是匹配项

  • 2982f 2982l 2982l 2981l 是另一个匹配项

  • 2982f 2982f 2982l 2982l 2981l is a match
  • 2982f 2982l 2982l 2981l is another match

因此,该模式出现了两次,但是我应该算一下

So, this pattern is appearing twice, however I should count it as one as this match is overlapping.

到目前为止,我存储了所有索引,其中出现了模式中的单词。

Till now, I am storing all the indexes, where the words in the pattern appear.

pt = '2982f  2982l  2981l'

seq = '2982f  2982f  2982l  2982l  2981l  3111m 3171f  2982f  2982l  2981l  2752l 2982f  2771f  2771l  2982l  2981l  2981l 3211f 3342f 3341l 3411f 3441f 2982f  2731f  2742f  2982l  2822f  2981l 2811f 2982f  3001f 2992f 2992m  2982l  2981l'

pt_split = pt.split()
pt_dic = collections.OrderedDict()
for i in pt_split:
    pt_dic[i] = []

count_seq = 0
for i in seq.split():
    if i in pt_dic:
        pt_dic[i].append(count_seq)
    count_seq += 1

print pt_dic

输出:

OrderedDict([(''2982f',[0,1,7 ,11,22,29]),('2982l',[2,3,8,14,25,33]),('2981l',[4,9,15,16,27,34])])

现在,我的想法是我想以减少索引的方式提取所有不重叠的匹配,从而保持间隙约束心神。但是,我不明白如何从这一点着手。

Now my idea is that I want to subtract the indexes in a way that I can extract all the non-overlapping matches keeping gap constraint in mind. But, I am not able to understand how to proceed from this point.

有人可以在这方面提供帮助,或者提供更好的解决方案吗?这将非常有帮助。谢谢。

Can someone please help in this, or provide even a better solution? It will be really helpful. Thanks.

推荐答案

可以使用正则表达式轻松解决。我们只需要将模式转换为一个正则表达式,然后计算该正则表达式在输入序列中匹配的频率。

This can be solved elegantly with regex. We just have to convert the pattern into a regex and then count how often that regex matches in the input sequence.

例如,假设输入 pattern ='AB C' max_gap = 2 ,我们要创建正则表达式,例如

For example, given the input pattern = 'A B C' and max_gap = 2, we want to create regex like

A(arbitrary_word){,2}?B(arbitrary_word){,2}?C

匹配任意用空格分隔的单词可以用(?: \S + \s +),因此我们得到:

Matching arbitrary words separated by spaces can be done with (?:\S+\s+), so we get:

import re

def count_matches(pattern, seq, max_gap):
    parts = map(re.escape, pattern.split())
    sep = r'\s+(?:\S+\s+){{,{}}}?'.format(max_gap)
    regex = r'\b{}\b'.format(sep.join(parts))
    return sum(1 for _ in re.finditer(regex, seq))






测试运行:


Test runs:

count_matches('2982f  2982l  2981l', '2982f  2982f  2982l  2982l  2981l', 2)
# result: 1

count_matches('A B C', 'A B D E C B A B A B C', 2)
# result: 2

这篇关于Python中具有间隙约束的非重叠模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆