使用拉宾,卡普搜索多个字符串中的模式 [英] Using Rabin-Karp to search for multiple patterns in a string
问题描述
据对拉宾,卡普字符串匹配算法的百科词条,它可以被用来寻找几个在一个字符串的同时不同的图案,同时仍保持线性的复杂性。很显然,这是很容易做时,所有的模式都是一样的长度,但我还是不明白,我们怎么能对图案同时与不同长度进行搜索时,preserve O(n)的复杂性。是否有人可以提供一些线索这光?
According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?
编辑(2011年12月):
Edit (December 2011):
维基百科的文章进行了更新,不再声称匹配O(n)的不同长度的多种模式。
The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).
推荐答案
我不知道这是否是正确的答案,但无论如何:
在构建的散列值,我们可以为您在该组字符串散列的匹配。阿卡,在电流的哈希值。散列函数/ code通常被实现为一个循环,而循环中,我们可以将我们的快速查找。
当然,我们必须跌倒 M
有从琴弦组的最大字符串长度。
更新:维基百科,
I'm not sure if this is the correct answer, but anyway:
While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.
Of course, we must pick m
to have the maximum string length from the set of strings.
Update: From Wikipedia,
[...]
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]
我们计算的电流的哈希值 M
步骤。在每一步有一个的临时的散列值,我们可以看看(O(1)复杂性)的散列集。所有的哈希值将具有相同的大小,即32位。
We calculate current hash in m
steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.
更新2:的摊销(平均)O(n)的时间复杂度?
上面我说的是 M
必须具有最大字符串长度。事实证明,我们可以利用相反。
随着<一href="http://en.wikipedia.org/wiki/Rabin-Karp%5Fstring%5Fsearch%5Falgorithm#Use%5Fof%5Fhashing%5Ffor%5Fshifting%5Fsubstring%5Fsearch"相对=nofollow>散列,用于移动字符串搜索和固定 M
尺寸就可以达到O(n)的复杂性。
如果我们有不同的长度,我们可以设置 M
到最小字符串长度。此外,在散列集,我们不散列与整个字符串,但与第一m个字符的关联。
现在,虽然搜索的文本,我们检查当前哈希散列集,我们检查了匹配相关联的弦。
这种技术将增加误报但平均有O(n)的时间复杂度。
Update 2: an amortized (average) O(n) time complexity ?
Above I said that m
must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m
size we can achieve O(n) complexity.
If we have variable length strings we can set m
to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.
This technique will increase the false alarms but on average it has O(n) time complexity.
这篇关于使用拉宾,卡普搜索多个字符串中的模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!