最短重复子串 [英] Shortest Repeating Sub-String
问题描述
我正在寻找一种有效的方法来提取最短的重复子字符串.例如:
input1 = 'dabcdbcdbcdd'输出 1 = 'bcd'input2 = 'cbabababac'输出 2 = 'ba'
如果您有任何与问题相关的答案或信息,我将不胜感激.
另外,在这篇文章中,人们建议我们可以使用像
re=^(.*?)\1+$
找到字符串中最小的重复模式.但是这样的表达式在 Python 中不起作用并且总是返回一个不匹配(我是 Python 新手,也许我错过了什么?).
---跟进---
这里的标准是寻找长度大于1且总长度最长的最短非重叠模式.
此模式的快速修复可能是
(.+?)\1+
您的正则表达式失败,因为它将重复字符串锚定到行的开头和结尾,只允许像 abcabcabc
这样的字符串,但不允许 xabcabcabcx
.另外,重复字符串的最小长度应该是 1,而不是 0(或者任何字符串都会匹配),因此 .+?
而不是 .*?
.
在 Python 中:
<预><代码>>>>进口重新>>>r = re.compile(r"(.+?)\1+")>>>r.findall("cbabababac")['巴']>>>r.findall("dabcdbcdbcdd")['bcd']但请注意,此正则表达式只会找到不重叠的重复匹配项,因此在最后一个示例中,将找不到解决方案 d
尽管这是最短的重复字符串.或者看这个例子:这里找不到 abcd
因为第一个 abcd
的 abc
部分已经在第一场比赛中用完了):
此外,它可能会返回多个匹配项,因此您需要在第二步中找到最短的匹配项:
<预><代码>>>>r.findall("abcdabcdabcabc")['abcd', 'abc']更好的解决方案:
要允许引擎也找到重叠的匹配项,请使用
(.+?)(?=\1)
这会找到一些字符串两次或更多次,如果它们重复的次数足够多,但它肯定会找到所有可能的重复子字符串:
<预><代码>>>>r = re.compile(r"(.+?)(?=\1)")>>>r.findall("dabcdbcdbcdd")['bcd', 'bcd', 'd']因此,您应该按长度对结果进行排序并返回最短的:
<预><代码>>>>min(r.findall("dabcdbcdbcdd") 或 [""], key=len)'d' 或 [""]
(感谢 J. F. Sebastian!)确保在根本没有匹配的情况下不会触发 ValueError
.
I am looking for an efficient way to extract the shortest repeating substring. For example:
input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'
input2 = 'cbabababac'
output2 = 'ba'
I would appreciate any answer or information related to the problem.
Also, in this post, people suggest that we can use the regular expression like
re=^(.*?)\1+$
to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).
--- follow up ---
Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc
but not xabcabcabcx
. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+?
instead of .*?
.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d
will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd
because the abc
part of the first abcd
has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""]
(thanks to J. F. Sebastian!) ensures that no ValueError
is triggered if there's no match at all.
这篇关于最短重复子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!