Python:Rabin-Karp算法哈希 [英] Python: Rabin-Karp algorithm hashing
问题描述
我正在实现Rabin-Karp算法的乐趣。我碰到了这个伪代码:
I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode:
RABIN -KARP -MATCHER (T, P, d, q)
1 n = T.length
2 m = P.length
3 h = d^(m-1) mod q
4 p=0
5 t= 0
6 for i = 1 to m
/ preprocessing
/
7 p = (dp + P [i]) mod q
8 t = (dt + T [i]) mod q
9 for s = 0 to n-m
/ matching
/
10 if p == t
11 if P [1... m] == T [s + 1...s + m]
12 print "Pattern occurs with shift" s
13 if s < n-m
14 t = (d(t-T[s + 1]h) + T [s + m + 1]) mod q
我在Python 2.7中实现如下:
I implemented in Python 2.7 like so:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t =0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m):
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
return result
其中的结果是一个包含模式文本索引的列表。
where result is a list containing the index in text of pattern.
我的代码未能在3141592653589793
中找到26,我怀疑这与伪代码第14行定义的哈希码有关。有人可以帮忙吗?
My code is failing to find the 26 in 3141592653589793 and I suspect it has something to do with my hashcode defined by line 14 of the pseudocode. Can anyone please help with this.
我传入了以下参数:
P = 26
T = 3141592653589793
d = 257
q = 11
P = "26" T = "3141592653589793" d = 257 q = 11
P和T必须是字符的字符串/数组
P and T must be strings/arrays of chars
推荐答案
这是您代码的有效版本:
Here is a working version of your code:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t = 0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m+1): # note the +1
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (t-h*ord(text[s]))%q # remove letter s
t = (t*d+ord(text[s+m]))%q # add letter s+m
t = (t+q)%q # make sure that t >= 0
return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))
输出为:
[6]
[0, 1, 2, 3]
第一步,检查 text [0..m] ==模式
。在第二步中,您要检查 text [1..m + 1] ==模式
。因此,您从哈希中删除了 text [0]
(目前将其与您预先计算的 h
相乘): t =(th * ord(text [s]))%q
。然后,在其中添加 text [m]
: t =(t * d + ord(text [s + m]))%q
。在下一步中,删除 text [1]
并添加 text [m + 1]
,依此类推。 t =(t + q)%q
这行是因为模数 q
的负数模产生的余数在该范围内(-q; 0]
,我们希望它在 [0; q)
范围内。
On the first step, you check whether text[0..m] == pattern
. On the second step, you want to check whether text[1..m+1] == pattern
. Thus you remove text[0]
from the hash (at the moment it is multiplied by your precomputed h
): t = (t-h*ord(text[s]))%q
. And then, add text[m]
to it: t = (t*d+ord(text[s+m]))%q
. On the next step, you remove text[1]
and add text[m+1]
, and so on. The t = (t+q)%q
line is there because a negative number modulo q
yields remainder in the range (-q; 0]
, and we want it to be in the range [0; q)
.
请注意,您要检查总共 n-m + 1
个子字符串,而不是 nm
,因此对于范围在(n-m + 1)内的s,正确的循环是。通过第二个示例进行检查(在 xxxxx中找到 xx)。
Note that you want to check a total of n-m+1
substrings, not n-m
, hence the correct loop is for s in range(n-m+1)
. Checked by the second example (finding "xx" in "xxxxx").
还应注意:
-
如果
m <,行
h = pow(d,m-1)%q
可能太慢/ code>大。最好在每个m-2
乘法之后对q
取模。
The line
h = pow(d,m-1)%q
may be too slow ifm
is large. It is better to take the result moduloq
after each of them-2
multiplications.
在最坏的情况下,该算法仍为O(nm)。使用 text = a * 100000
和 pattern = a * 50000
,它将找到50001个位置文本的子字符串与模式匹配,它将逐字符检查它们。如果您希望代码在这种极端情况下可以快速运行,则应跳过逐个字符的比较,并找到一种处理误报的方法(即哈希值相等而字符串不相等)。选择较大的素数 q
可能有助于将误报的可能性降低到可接受的水平。
This algorithm is still O(nm) in the worst case. With text="a"*100000
and pattern="a"*50000
, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number q
may help reduce the probability of a false positive to an acceptable level.
这篇关于Python:Rabin-Karp算法哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!