滚动哈希的快速实现 [英] Fast implementation of Rolling hash

查看:231
本文介绍了滚动哈希的快速实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个滚动哈希来搜索文件中的模式。 (我正在尝试使用 Rabin-Karp字符串搜索算法) 。



我了解一个好的哈希是如何工作的以及一个好的滚动哈希应该如何工作,但我无法弄清楚如何在滚动哈希时有效地实现 divide (或反向乘法)。我也读过rsync使用滚动版本的 adler32 ,但看起来不够随机散列。



理想情况下,如果你可以指向我优化的C / C ++实现,但任何指向正确方向的指针都会有所帮助。 解密方案

Cipher的主基础理念应该体面地工作 - 虽然他发布的解决方案看起来有点粗略。



我不认为在这种方法中需要反向乘法。
这是我的解决方案:



假设我们当前散列的字符串是abc,我们想追加d并删除a。



就像Cipher一样,我的基本散列算法是:

  unsigned hash(常量字符串& s)
{
unsigned ret = 0;
for(int i = 0; i< s.size(); i ++)
{
ret * = PRIME_BASE; //移动一个
ret + = s [i]; //添加当前字符
ret%= PRIME_MOD; //不会溢出
}
return ret;

$ / code $ / pre

现在,实现滑动:

  hash1 = [0] * base ^(n-1)+ [1] * base ^(n-2)+ ... + [n-1] 

我们希望在最后添加一些内容并删除第一个值,所以

  hash2 = [1] * base ^(n-1)+ [2] * base ^(n-2)+ ... + [首先我们可以添加最后一个字母:






$ b

  hash2 =(hash1 * PRIME_BASE)+ newchar; 
=> [0] * base ^ n + [1] * base ^(n-1)+ ... + [n-1] * base + [n]

然后简单地减去第一个字符:

$ $ $ $ $ $ $ $> hash2 - = firstchar * pow(base,n);
=> [1] * base ^(n-1)+ ... + [n]

注意:你必须小心溢出。你可以选择让它溢出unsigned int,但我认为它更容易碰撞(但也更快!)



这是我的实现:

  #include< iostream> 
#include< string>
使用namespace std;

无符号PRIME_BASE = 257;
const unsigned PRIME_MOD = 1000000007;

unsigned hash(const string& s)
{
long long ret = 0;
for(int i = 0; i< s.size(); i ++)
{
ret = ret * PRIME_BASE + s [i];
ret%= PRIME_MOD; //不会溢出
}
return ret;


int rabin_karp(const string& needle,const string& haystack)
{
//我使用long long来避免溢出
long long hash1 = hash(needle);
long long hash2 = 0;

//你可以使用指数乘以额外的速度
long long power = 1;
for(int i = 0; i< needle.size(); i ++)
power =(power * PRIME_BASE)%PRIME_MOD;

for(int i = 0; i< haystack.size(); i ++)
{
//添加最后一个字母
hash2 = hash2 * PRIME_BASE + haystack [i];
hash2%= PRIME_MOD;

//如果需要,删除第一个字符
if(i> = needle.size())
{
hash2 - = power * haystack [i -needle.size()]%PRIME_MOD;
if(hash2 <0)// negative可以用mod
hash2 + = PRIME_MOD变为正值;
}

//匹配? (i> = needle.size() - 1&& hash1 == hash2)
返回i - (needle.size()-1);
}

返回-1;
}

int main()
{
cout<< rabin_karp(waldo,willy werther warhol wendy - > waldo< - )<< ENDL;
}


I need a Rolling hash to search for patterns in a file. (I am trying to use the Rabin-Karp string search algorithm).

I understand how a good Hash works and how a good Rolling Hash should work but I am unable to figure out how to efficiently implement the divide (or inverse multiplication) when rolling the hash. I also read rsync uses rolling version of adler32 but that doesn't looks like a random enough hash.

Ideally it will be great if you can point me to an optimized C/C++ implementation, but any pointers in the right direction will help.

解决方案

Cipher's "prime base" idea should work decently - though the solution he posted looks a bit sketchy.

I don't think there's any need for inverse multiplication in this method. Here's my solution:

Say the string we currently have hashed is "abc", and we want to append "d" and remove "a".

Just like Cipher, my basic hash algorithm will be:

unsigned hash(const string& s)
{
    unsigned ret = 0;
    for (int i = 0; i < s.size(); i++)
    {
        ret *= PRIME_BASE; //shift over by one
        ret += s[i]; //add the current char
        ret %= PRIME_MOD; //don't overflow
    }
    return ret;
}

Now, to implement sliding:

hash1 = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]

We'd like to add something at the end and remove the first value, so

hash2 = [1]*base^(n-1) + [2]*base^(n-2) + ... + [n]

First we can add the last letter:

hash2 = (hash1 * PRIME_BASE) + newchar;
=> [0]*base^n + [1]*base^(n-1) + ... + [n-1]*base + [n]

Then simply subtract the first character:

hash2 -= firstchar * pow(base, n);
=> [1]*base^(n-1) + ... + [n]

An important note: you have to be careful about overflow. You can choose to just let it overflow unsigned int, but I think it's much more prone to collision (but also faster!)

Here's my implementation:

#include <iostream>
#include <string>
using namespace std;

const unsigned PRIME_BASE = 257;
const unsigned PRIME_MOD = 1000000007;

unsigned hash(const string& s)
{
    long long ret = 0;
    for (int i = 0; i < s.size(); i++)
    {
        ret = ret*PRIME_BASE + s[i];
        ret %= PRIME_MOD; //don't overflow
    }
    return ret;
}

int rabin_karp(const string& needle, const string& haystack)
{
    //I'm using long longs to avoid overflow
    long long hash1 = hash(needle);
    long long hash2 = 0;

    //you could use exponentiation by squaring for extra speed
    long long power = 1;
    for (int i = 0; i < needle.size(); i++)
        power = (power * PRIME_BASE) % PRIME_MOD;

    for (int i = 0; i < haystack.size(); i++)
    {
        //add the last letter
        hash2 = hash2*PRIME_BASE + haystack[i];
        hash2 %= PRIME_MOD;

        //remove the first character, if needed
        if (i >= needle.size())
        {
            hash2 -= power * haystack[i-needle.size()] % PRIME_MOD;
            if (hash2 < 0) //negative can be made positive with mod
                hash2 += PRIME_MOD;
        }

        //match?
        if (i >= needle.size()-1 && hash1 == hash2)
            return i - (needle.size()-1);
    }

    return -1;
}

int main()
{
    cout << rabin_karp("waldo", "willy werther warhol wendy --> waldo <--") << endl;
}

这篇关于滚动哈希的快速实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆