这是算法直线? [英] Is this algorithm linear?
问题描述
灵感来自这两个问题:<一href="http://stackoverflow.com/questions/8525692/string-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes">String操作:计算&QUOT;相似的字符串,其后缀&QUOT的; 和<一href="http://stackoverflow.com/questions/8552101/program-execution-varies-as-the-i-p-size-increases-beyond-5-in-c">Program执行变化是超越5的I / P尺寸的增加在C中,我想出了下面的算法。
Inspired by these two questions: String manipulation: calculate the "similarity of a string with its suffixes" and Program execution varies as the I/P size increases beyond 5 in C, I came up with the below algorithm.
该问题将是
- 在它是正确的,或者有我在的推理犯了一个错误?
- 什么是算法的最坏情况下的复杂性?
上下文的位在前。对于两个字符串,定义他们的相似性的两个最长的共同preFIX的长度。字符串的总自相似的取值的是取值的其所有后缀的相似性的总和。因此,例如,总的自相似性的 abacab 的是6 + 0 + 1 + 0 + 2 + 0 = 9和的一的总自相似重复ñ
次是 N *(N + 1)/ 2
。
A bit of context first. For two strings, define their similarity as the length of the longest common prefix of the two. The total self-similarity of a string s is the sum of the similarities of s with all of its suffixes. So for example, the total self-similarity of abacab is 6 + 0 + 1 + 0 + 2 + 0 = 9 and the total self-similarity of a repeated n
times is n*(n+1)/2
.
的算法的说明:的算法是根据高德纳 - 莫里斯 - 普拉特字符串搜索算法,在那的边框的字符串的prefixes玩中心作用。
Description of the algorithm: The algorithm is based on the Knuth-Morris-Pratt string searching algorithm, in that the borders of the string's prefixes play the central role.
要概括:一个的边框的字符串的取值的是一个适当的子串的的取值 B 的是同时一个preFIX和后缀的取值的。
To recapitulate: a border of a string s is a proper substring b of s which is simultaneously a prefix and a suffix of s.
备注:如果的 B 和 C 的是取值的用的 B 的比的ç短<边界/ EM>,然后 B 的也是边界的 C 的,相反,每一个边框的 C 的也是的 s的边界的。
Remark: If b and c are borders of s with b shorter than c, then b is also a border of c, and conversely, every border of c is also a border of s.
让取值的是长度的字符串的 N 和 P 的是一个preFIX的取值的带长度的我的。我们把边框的 B 的宽度的 K 的的 P 的*不*扩展如果我==ñ
或 S [I]!= S [K]
,否则它是可扩展(长度 K + 1
$ P $的PFIX的取值的是那么的取值 I + 1 preFIX >)
Let s be a string of length n and p be a prefix of s with length i. We call a border b with width k of p *non-extensible* if either i == n
or s[i] != s[k]
, otherwise it's extensible (the length k+1
prefix of s is then a border of the length i+1
prefix of s).
现在,如果的取值的最长的共同preFIX,并开始与后缀s [I],I&GT; 0
,有长的 K 的,长度的 K 的$ P $中的取值的是一个不可扩展边界PFIX长度的 I + K 的$ P $中的取值的PFIX。这是一个边界,因为它是一种常见的preFIX的取值和 S [我... N-1]
,如果它是可扩展的,它不会是最长的共同preFIX。
Now, if the longest common prefix of s and the suffix starting with s[i], i > 0
, has length k, the length k prefix of s is a non-extensible border of the length i+k prefix of s. It is a border because it's a common prefix of s and s[i .. n-1]
, and if it were extensible, it wouldn't be the longest common prefix.
相反,长度每个非扩展边界(长度的 K 的)的我的$ P $的PFIX的取值的是最长的共同中的取值的preFIX,并开始与后缀s [IK]
。
Conversely, every non-extensible border (of length k) of the length i prefix of s is the longest common prefix of s and the suffix starting with s[i-k]
.
因此,我们可以通过累加长度所有非扩展边界的长度计算的取值的总的自相似性的我的的prefixes的小号的 1&LT; = I&LT; = N
。要做到这一点
So we can calculate the total self-similarity of s by summing the lengths of all non-extensible borders of the length i prefixes of s, 1 <= i <= n
. To do that
- 计算prefixes的标准KMP preprocessing一步的最宽边框的宽度。
- 计算prefixes的最宽不可伸长的边界的宽度。
- 对于每一个的我的
1&LT; = I&LT; = N
,如果P = S [0。 I-1]
有一个非空的不可扩展的边界,让我们的 B 的是最广泛的这些,加上宽度的 B 的和对于所有非空边界的 C 的 B 的,如果是的 P 的,加上它的长度不可扩展的边界。 - 添加长度的 N 的的的取值的,因为这是不包括在上面的。
- Calculate the width of the widest borders of the prefixes by the standard KMP preprocessing step.
- Calculate the width of the widest non-extensible borders of the prefixes.
- For each i,
1 <= i <= n
, ifp = s[0 .. i-1]
has a non-empty non-extensible border, let b be the widest of these, add the width of b and for all non-empty borders c of b, if it is a non-extensible border of p, add its length. - Add the length n of s, since that isn't covered by the above.
code(C):
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/*
* Overflow and NULL checks omitted to not clutter the algorithm.
*/
int similarity(char *text){
int *borders, *ne_borders, len = strlen(text), i, j, sim;
borders = malloc((len+1)*sizeof(*borders));
ne_borders = malloc((len+1)*sizeof(*ne_borders));
i = 0;
j = -1;
borders[i] = j;
ne_borders[i] = j;
/*
* Find the length of the widest borders of prefixes of text,
* standard KMP way, O(len).
*/
while(i < len){
while(j >= 0 && text[i] != text[j]){
j = borders[j];
}
++i, ++j;
borders[i] = j;
}
/*
* For each prefix, find the length of its widest non-extensible
* border, this part is also O(len).
*/
for(i = 1; i <= len; ++i){
j = borders[i];
/*
* If the widest border of the i-prefix has width j and is
* extensible (text[i] == text[j]), the widest non-extensible
* border of the i-prefix is the widest non-extensible border
* of the j-prefix.
*/
if (text[i] == text[j]){
j = ne_borders[j];
}
ne_borders[i] = j;
}
/* The longest common prefix of text and text is text. */
sim = len;
for(i = len; i > 0; --i){
/*
* If a longest common prefix of text and one of its suffixes
* ends right before text[i], it is a non-extensible border of
* the i-prefix of text, and conversely, every non-extensible
* border of the i-prefix is a longest common prefix of text
* and one of its suffixes.
*
* So, if the i-prefix has any non-extensible border, we must
* sum the lengths of all these. Starting from the widest
* non-extensible border, we must check all of its non-empty
* borders for extendibility.
*
* Can this introduce nonlinearity? How many extensible borders
* shorter than the widest non-extensible border can a prefix have?
*/
if ((j = ne_borders[i]) > 0){
sim += j;
while(j > 0){
j = borders[j];
if (text[i] != text[j]){
sim += j;
}
}
}
}
free(borders);
free(ne_borders);
return sim;
}
/* The naive algorithm for comparison */
int common_prefix(char *text, char *suffix){
int c = 0;
while(*suffix && *suffix++ == *text++) ++c;
return c;
}
int naive_similarity(char *text){
int len = (int)strlen(text);
int i, sim = 0;
for(i = 0; i < len; ++i){
sim += common_prefix(text,text+i);
}
return sim;
}
int main(int argc, char *argv[]){
int i;
for(i = 1; i < argc; ++i){
printf("%d\n",similarity(argv[i]));
}
for(i = 1; i < argc; ++i){
printf("%d\n",naive_similarity(argv[i]));
}
return EXIT_SUCCESS;
}
所以,这是正确的?我会很吃惊,如果不是,但我已经错了。
So, is this correct? I'd be rather surprised if not, but I've been wrong before.
什么是该算法的最坏情况下的复杂?
What is the worst case complexity of the algorithm?
我认为这是为O(n),但我还没有找到一个证明,可扩展边界的数preFIX能已经包含在它的最广泛的非可扩展边界为界(或者更确切地说,是总这种情况发生的数量为O(n))。
I think it's O(n), but I haven't yet found a proof that the number of extensible borders a prefix can have contained in its widest non-extensible border is bounded (or rather, that the total number of such occurrences is O(n)).
我最感兴趣的是锋利的界限,但如果你能证明它是如为O(n * log n)的或为O(n ^(1 + X))的小型 X
,这已经很好了。 (这显然是在最坏的情况平方,所以回答这是为O(n ^ 2)如果辅之以二次或接近二次行为的一个例子是唯一感兴趣的。)
I'm most interested in sharp bounds, but if you can prove that it's e.g. O(n*log n) or O(n^(1+x)) for small x
, that's already good. (It's obviously at worst quadratic, so an answer of "It's O(n^2)" is only interesting if accompanied by an example for quadratic or near-quadratic behaviour.)
推荐答案
这看起来像一个非常不错的主意,但可悲的是,我相信最坏的情况下的行为是O(n ^ 2)。
This looks like a really neat idea, but sadly I believe the worst case behaviour is O(n^2).
下面是我尝试一个反例。 (我不是一个数学家,所以请原谅我使用Python,而不是方程式前preSS我的想法!)
Here is my attempt at a counterexample. (I'm not a mathematician so please forgive my use of Python instead of equations to express my ideas!)
考虑字符串4K + 1符号
Consider the string with 4K+1 symbols
s = 'a'*K+'X'+'a'*3*K
这将有
borders[1:] = range(K)*2+[K]*(2*K+1)
ne_borders[1:] = [-1]*(K-1)+[K-1]+[-1]*K+[K]*(2*K+1)
需要注意的是:
Note that:
1)ne_borders [Ⅰ]将等于K中第i(2K + 1)的值。
1) ne_borders[i] will equal K for (2K+1) values of i.
2)为0℃= J&其中; = K,边框[J] = J-1
2) for 0<=j<=K, borders[j]=j-1
3)你的算法决赛圈将进入其中j内循环== K中的我2K + 1的值
3) the final loop in your algorithm will go into the inner loop with j==K for 2K+1 values of i
4)内循环将迭代K次,减少J即可0
4) the inner loop will iterate K times to reduce j to 0
5),这将导致该算法需要多N * N / 8的操作做长度为N的最坏情况字符串
5) This results in the algorithm needing more than N*N/8 operations to do a worst case string of length N.
例如,在K = 4它绕着内环39倍
For example, for K=4 it goes round the inner loop 39 times
s = 'aaaaXaaaaaaaaaaaa'
borders[1:] = [0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4]
ne_borders[1:] = [-1, -1, -1, 3, -1, -1, -1, -1, 4, 4, 4, 4, 4, 4, 4, 4, 4]
有关K = 2248它绕着内环10111503次!
For K=2,248 it goes round the inner loop 10,111,503 times!
或许有办法解决的算法对于这种情况?
Perhaps there is a way to fix the algorithm for this case?
这篇关于这是算法直线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!