寻找长期反复子在一个巨大的字符串 [英] finding long repeated substrings in a massive string

查看:142
本文介绍了寻找长期反复子在一个巨大的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我天真地以为我可以建立一个后缀特里,我保持一个访问计数为每个节点,然后用最深的计数大于一节点的结果集,我要找的。

我有一个真的很长的字符串(百兆)。我的RAM约1 GB。

这是为什么构建一个后缀特里与统计数据的效率太低,空间智慧为我工作。引述维基百科的后缀树

  

存储字符串的后缀树一般要求不是存储字符串本身显著更多的空间。

     

的大量的每个边缘信息和节点使得后缀树非常昂贵的,耗时约在良好的实现的源文本十到二十倍的内存大小。后缀数组降低此要求的四倍,并且研究人员继续寻找更小的索引结构。

这是维基百科的上树的意见,没有线索。

我怎样才能找到长期的重复序列在如此大的数据量,并在合理时间内(例如小于现代台式机上一个小时)?

(一些维基百科的链接,以避免人张贴的答案:算法对字符串,特别是< A HREF =htt​​p://en.wikipedia.org/wiki/Longest_repeated_substring_problem>最长重复子问题 ;-))

解决方案

做到这一点的有效方法是创建子字符串的索引,并将其分类。这是一个为O​​(n LG电子n)的操作。

BWT COM pression做这一步,所以它是一个很好理解的问题,有基数和后缀(要求为O(n))排序实现,并且这种以使其尽可能高效。它仍然需要很长的时间,也许是几秒钟的大文章。

如果你想使用的工具code,C ++ 的std :: stable_sort()执行的的优于的std ::排序()的自然语言(而不是C的的qsort快得多(),但出于不同的原因)。

然后参观每一项,看它的普通子的长度与邻国为O(n)。

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for.

I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.

This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree:

storing a string's suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

And that was wikipedia's comments on the tree, not trie.

How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?

(Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

解决方案

The effective way to do this is to create an index of the sub-strings, and sort them. This is an O(n lg n) operation.

BWT compression does this step, so its a well understood problem and there are radix and suffix (claim O(n)) sort implementations and such to make it as efficient as possible. It still takes a long time, perhaps several seconds for large texts.

If you want to use utility code, C++ std::stable_sort() performs much better than std::sort() for natural language (and much faster than C's qsort(), but for different reasons).

Then visiting each item to see the length of its common substring with its neighbours is O(n).

这篇关于寻找长期反复子在一个巨大的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆