Python：Rabin-Karp算法哈希 [英] Python: Rabin-Karp algorithm hashing

查看：153 发布时间：2020/6/3 20:07:07 python string algorithm pattern-matching

本文介绍了Python：Rabin-Karp算法哈希的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在实现Rabin-Karp算法的乐趣。我碰到了这个伪代码：

I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode:

    RABIN -KARP -MATCHER (T, P, d, q)
    1 n = T.length
    2 m = P.length
    3 h = d^(m-1) mod q
    4 p=0
    5 t= 0
    6 for i = 1 to m
    / preprocessing
    /
    7 p = (dp + P [i]) mod q
    8 t = (dt + T [i]) mod q
    9 for s = 0 to n-m
    / matching
    /
    10     if p == t
    11         if P [1... m] == T [s + 1...s + m]
    12             print "Pattern occurs with shift" s
    13     if s < n-m
    14         t  = (d(t-T[s + 1]h) + T [s + m + 1]) mod q

我在Python 2.7中实现如下：

I implemented in Python 2.7 like so:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t =0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q
    for s in range(n-m):
        if p == t: # check character by character
            match = True
            for i in range(m):
                if pattern[i] != text[s+i]:
                    match = False
                    break
            if match:
                result = result + [s]
        if s < n-m:
                t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
    return result

其中的结果是一个包含模式文本索引的列表。

where result is a list containing the index in text of pattern.

我的代码未能在3141592653589793
中找到26，我怀疑这与伪代码第14行定义的哈希码有关。有人可以帮忙吗？

My code is failing to find the 26 in 3141592653589793 and I suspect it has something to do with my hashcode defined by line 14 of the pseudocode. Can anyone please help with this.

我传入了以下参数：

P = 26
T = 3141592653589793
d = 257
q = 11

P = "26" T = "3141592653589793" d = 257 q = 11

P和T必须是字符的字符串/数组

P and T must be strings/arrays of chars

推荐答案

这是您代码的有效版本：

Here is a working version of your code:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t = 0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q
    for s in range(n-m+1): # note the +1
        if p == t: # check character by character
            match = True
            for i in range(m):
                if pattern[i] != text[s+i]:
                    match = False
                    break
            if match:
                result = result + [s]
        if s < n-m:
            t = (t-h*ord(text[s]))%q # remove letter s
            t = (t*d+ord(text[s+m]))%q # add letter s+m
            t = (t+q)%q # make sure that t >= 0
    return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))

输出为：

[6]
[0, 1, 2, 3]

第一步，检查 text [0..m] ==模式。在第二步中，您要检查 text [1..m + 1] ==模式。因此，您从哈希中删除了 text [0] （目前将其与您预先计算的 h 相乘）： t =（th * ord（text [s]））％q 。然后，在其中添加 text [m] ： t =（t * d + ord（text [s + m]））％q 。在下一步中，删除 text [1] 并添加 text [m + 1] ，依此类推。 t =（t + q）％q 这行是因为模数 q 的负数模产生的余数在该范围内（-q; 0] ，我们希望它在 [0; q）范围内。

On the first step, you check whether text[0..m] == pattern. On the second step, you want to check whether text[1..m+1] == pattern. Thus you remove text[0] from the hash (at the moment it is multiplied by your precomputed h): t = (t-h*ord(text[s]))%q. And then, add text[m] to it: t = (t*d+ord(text[s+m]))%q. On the next step, you remove text[1] and add text[m+1], and so on. The t = (t+q)%q line is there because a negative number modulo q yields remainder in the range (-q; 0], and we want it to be in the range [0; q).

请注意，您要检查总共 n-m + 1 个子字符串，而不是 nm ，因此对于范围在（n-m + 1）内的s，正确的循环是。通过第二个示例进行检查（在 xxxxx中找到 xx）。


Note that you want to check a total of n-m+1 substrings, not n-m, hence the correct loop is for s in range(n-m+1). Checked by the second example (finding "xx" in "xxxxx").
还应注意：
 
  如果 m <，行 h = pow（d，m-1）％q 可能太慢/ code>大。最好在每个 m-2 乘法之后对 q 取模。


The line h = pow(d,m-1)%q may be too slow if m is large. It is better to take the result modulo q after each of the m-2 multiplications.
在最坏的情况下，该算法仍为O（nm）。使用 text = a * 100000 和 pattern = a * 50000 ，它将找到50001个位置文本的子字符串与模式匹配，它将逐字符检查它们。如果您希望代码在这种极端情况下可以快速运行，则应跳过逐个字符的比较，并找到一种处理误报的方法（即哈希值相等而字符串不相等）。选择较大的素数 q 可能有助于将误报的可能性降低到可接受的水平。
This algorithm is still O(nm) in the worst case. With text="a"*100000 and pattern="a"*50000, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number q may help reduce the probability of a false positive to an acceptable level.

                        这篇关于Python：Rabin-Karp算法哈希的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python：Rabin-Karp算法哈希 [英] Python: Rabin-Karp algorithm hashing

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python：Rabin-Karp算法哈希 [英] Python: Rabin-Karp algorithm hashing

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭