拉宾卡普字符串匹配算法 [英] Rabin Karp string matching algorithm

查看:215
本文介绍了拉宾卡普字符串匹配算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到了网站上的论坛,这个拉宾卡普字符串匹配算法,我很感兴趣,想实现它,但我想知道是否有人能告诉我为什么变量ULONG Q和ulong中D是100007和256分别为:S'
没有这些价值观随身携带什么意义呢?



 静态无效的主要(字串[] args)
{
串A =字符串包含的模式。
字符串B =模式;
ULONG的sIgA = 0;
ULONG SIGB = 0;
ULONG Q = 100007;
ULONG D = 256;
的for(int i = 0; I< b.length个;我++)
{
=分泌型IgA(sIgA的* D +(ULONG)A [i])%Q;
SIGB =(SIGB * D +(ULONG)B〔I])%Q;
}
如果(sIgA的== SIGB)
{
Console.WriteLine(的String.Format(>> {0}<< {1}, A.Substring(0,b.length个),A.Substring(b.length个)));
的回报;
}
ULONG POW = 1;
为(INT K = 1; K< = b.length个 - 1; k ++)
= POW(战俘* D)%Q;

为(INT J = 1; J< =则为a.length - b.length个; J ++)
{
=分泌型IgA(sIgA的+ Q - 战俘*(ULONG) A [的J - 1]%Q)%Q;
=分泌型IgA(sIgA的* D +(ULONG)A [J + b.length个 - 1])%Q;
如果(sIgA的== SIGB)
{
如果(A.Substring(J,b.length个)== B)
{
Console.WriteLine(串.Format({0}>> {1}<< {2},A.Substring(0,J),
A.Substring(J,b.length个),
A.Substring(J + b.length个)));
的回报;
}
}
}
Console.WriteLine(不可复制!);
}


解决方案

关于幻数保罗的回答是相当清楚的。



至于代码而言,拉宾卡普的主要想法是执行字符串和模式的滑动部分之间的哈希值进行比较。



哈希不能对整个子每次来计算,否则,计算复杂度会二次为O(n ^ 2)而不是线性 O(N)



因此,的滚动散列的函数被施加,如在每次迭代仅需要一个字符来更新子串的散列值。



那么,让我们来注释代码:

 的for(int i = 0; I< b.length个;我++)
{
=分泌型IgA(sIgA的* D +(ULONG)A [i])%Q;
SIGB =(SIGB * D +(ULONG)B〔I])%Q;
}
如果(sIgA的== SIGB)
{
Console.WriteLine(的String.Format(>> {0}<< {1}, A.Substring(0,b.length个),A.Substring(b.length个)));
的回报;
}



^ 这片计算模式的哈希值 B SIGB )和的初始子的哈希码A B 的长度相同。
其实它并不完全正确,因为散列可以collide¹,因此,有必要修改if语句:如果(sIgA的== SIGB&放大器;&安培; A.Substring(0,b.length个)== b)

  ULONG POW = 1。 
为(INT K = 1; K< = b.length个 - 1; k ++)
= POW(战俘* D)%Q;



^ 这里的计算 POW 这是需要进行滚动哈希

 为(INT J = 1;Ĵ < =则为a.length  -  b.length个; J ++)
{
=分泌型IgA(sIgA的+ Q - 战俘*(ULONG)一个[J - 1]%Q)%Q;
=分泌型IgA(sIgA的* D +(ULONG)A [J + b.length个 - 1])%Q;
如果(sIgA的== SIGB)
{
如果(A.Substring(J,b.length个)== B)
{
Console.WriteLine(串.Format({0}>> {1}<< {2},A.Substring(0,J),
A.Substring(J,b.length个),
A.Substring(J + b.length个)));
的回报;
}
}
}



^ 最后,剩余的字符串(即从第二个字符结束),扫描更新的子字符串的散列值,并与B(在开始计算)的哈希值进行比较。



如果两个散列相等,则子和图案都compared¹,如果他们实际上是等于返回的消息。






散列值可以碰撞;因此,如果两个字符串有不同的hash值,他们的绝对不同,但如果两个散列都是平等的,他们的可以等于或没有。


I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? What significance do these values carry with them?

static void Main(string[] args)
{
    string A = "String that contains a pattern.";
    string B = "pattern";
    ulong siga = 0;
    ulong sigb = 0;
    ulong Q = 100007;
    ulong D = 256;
    for (int i = 0; i < B.Length; i++)
    {
        siga = (siga * D + (ulong)A[i]) % Q;
        sigb = (sigb * D + (ulong)B[i]) % Q;
    }
    if (siga == sigb)
    {
        Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
        return;
    }
    ulong pow = 1;
    for (int k = 1; k <= B.Length - 1; k++)
        pow = (pow * D) % Q;

    for (int j = 1; j <= A.Length - B.Length; j++)
    {
        siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
        siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
        if (siga == sigb)
        {
            if (A.Substring(j, B.Length) == B)
            {
                Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                    A.Substring(j, B.Length),
                                                                    A.Substring(j + B.Length)));
                return;
            }
        }
    }
    Console.WriteLine("Not copied!");
}

解决方案

About the magic numbers Paul's answer is pretty clear.

As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.

The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n).

Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.

So, let's comment your code:

for (int i = 0; i < B.Length; i++)
{
    siga = (siga * D + (ulong)A[i]) % Q;
    sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
    Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
    return;
}

^ This piece computes the hash of pattern B (sigb), and the hashcode of the initial substring of A of the same length of B. Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B).

ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
    pow = (pow * D) % Q;

^ Here's computed pow that is necessary to perform the rolling hash.

for (int j = 1; j <= A.Length - B.Length; j++)
{
    siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
    siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
    if (siga == sigb)
    {
        if (A.Substring(j, B.Length) == B)
        {
            Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                A.Substring(j, B.Length),
                                                                A.Substring(j + B.Length)));
            return;
        }
    }
}

^ Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).

If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.


¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.

这篇关于拉宾卡普字符串匹配算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆