最常见的长X的子 [英] Most common substring of length X

查看:201
本文介绍了最常见的长X的子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串s,我想搜索的长X中最常出现在S中的子字符串。重叠子是不允许的。

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.

例如,如果s =aoaoa和X = 3,该算法应该找到AOA(出现2次以s)。

For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).

请问一个算法存在,这是否在O(n)的时间?

Does an algorithm exist that does this in O(n) time?

推荐答案

您可以使用做到这一点滚动哈希在O(n)时间(假设好的哈希分布)。一个简单的滚动哈希值将是字符串中的字符的异或,您可以使用仅2异或的previous串哈希增量计算它。 (参见更好的滚动哈希比XOR的维基百科条目。)计算的N-X + 1使用滚动散列在O(n)的时间串的哈希值。如果没有冲突,答案是明确的 - 如果碰撞发生,你需要做更多的工作。我的大脑好痛试图找出是否可以在O(n)时间都可以解决。

You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.

更新:

下面是一个随机O(n)的算法。您可以通过扫描哈希表查找O(n)时间上散列(保持简单,假定没有任何关系)。找到一个X长度字符串与哈希(备存纪录在哈希表,或者只是重做滚动哈希)。然后使用 O(n)的字符串搜索算法找到该字符串的所有出现在秒。如果你发现你在哈希表中记录的事件数相同,就大功告成了。

Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.

如果不是,那意味着你有一个散列冲突。选择一个新的随机哈希函数,然后再试一次。如果散列函数具有的log(n)1比特,并且两两独立[习题(八(多个)==小时(吨))≤; 1/2 ^ {N + 1}如果s =吨],则概率,最频繁的X长度的子串以s散列带冲撞的<!=正其他长度x的子s如果为1/2。因此,如果有冲突,选择一个新的随机哈希函数,然后重试,你需要尝试的只有一个常数,你成功了。

If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.

现在,我们只需要一个随机两两独立的滚动哈希算法。

Now we only need a randomized pairwise independent rolling hash algorithm.

UPDATE2:

其实,你需要2log(n)的散列位,以避免所有的(N选2)的碰撞,因为任何的碰撞可能隐藏了正确的答案。尽管如此可行的,它看起来像散列由一般多项式除法应该做的把戏。

Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.

这篇关于最常见的长X的子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆