最短重复子串 [英] Shortest Repeating Sub-String

查看:63
本文介绍了最短重复子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种有效的方法来提取最短的重复子字符串.例如:

input1 = 'dabcdbcdbcdd'输出 1 = 'bcd'input2 = 'cbabababac'输出 2 = 'ba'

如果您有任何与问题相关的答案或信息,我将不胜感激.

另外,在这篇文章中,人们建议我们可以使用像

re=^(.*?)\1+$

找到字符串中最小的重复模式.但是这样的表达式在 Python 中不起作用并且总是返回一个不匹配(我是 Python 新手,也许我错过了什么?).

---跟进---

这里的标准是寻找长度大于1且总长度最长的最短非重叠模式.

解决方案

此模式的快速修复可能是

(.+?)\1+

您的正则表达式失败,因为它将重复字符串锚定到行的开头和结尾,只允许像 abcabcabc 这样的字符串,但不允许 xabcabcabcx.另外,重复字符串的最小长度应该是 1,而不是 0(或者任何字符串都会匹配),因此 .+? 而不是 .*?.

在 Python 中:

<预><代码>>>>进口重新>>>r = re.compile(r"(.+?)\1+")>>>r.findall("cbabababac")['巴']>>>r.findall("dabcdbcdbcdd")['bcd']

但请注意,此正则表达式只会找到不重叠的重复匹配项,因此在最后一个示例中,将找不到解决方案 d 尽管这是最短的重复字符串.或者看这个例子:这里找不到 abcd 因为第一个 abcdabc 部分已经在第一场比赛中用完了):

<预><代码>>>>r.findall("abcabcdabcd")['abc']

此外,它可能会返回多个匹配项,因此您需要在第二步中找到最短的匹配项:

<预><代码>>>>r.findall("abcdabcdabcabc")['abcd', 'abc']

更好的解决方案:

要允许引擎也找到重叠的匹配项,请使用

(.+?)(?=\1)

这会找到一些字符串两次或更多次,如果它们重复的次数足够多,但它肯定会找到所有可能的重复子字符串:

<预><代码>>>>r = re.compile(r"(.+?)(?=\1)")>>>r.findall("dabcdbcdbcdd")['bcd', 'bcd', 'd']

因此,您应该按长度对结果进行排序并返回最短的:

<预><代码>>>>min(r.findall("dabcdbcdbcdd") 或 [""], key=len)'d'

或 [""](感谢 J. F. Sebastian!)确保在根本没有匹配的情况下不会触发 ValueError.

I am looking for an efficient way to extract the shortest repeating substring. For example:

input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'

input2 = 'cbabababac'
output2 = 'ba'

I would appreciate any answer or information related to the problem.

Also, in this post, people suggest that we can use the regular expression like

re=^(.*?)\1+$

to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).

--- follow up ---

Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.

解决方案

A quick fix for this pattern could be

(.+?)\1+

Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.

In Python:

>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']

But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):

>>> r.findall("abcabcdabcd")
['abc']

Also, it may return several matches, so you'd need to find the shortest one in a second step:

>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']

Better solution:

To allow the engine to also find overlapping matches, use

(.+?)(?=\1)

This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:

>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']

Therefore, you should sort the results by length and return the shortest one:

>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'

The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.

这篇关于最短重复子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆