特里与后缀树与后缀数组 [英] Trie vs. suffix tree vs. suffix array

查看:196
本文介绍了特里与后缀树与后缀数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

哪些结构提供了最好的性能测试结果;特里(preFIX树),后缀树或后缀数组?是否有其他类似的结构?这些是什么结构,良好的Java实现?

Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?

编辑:在这种情况下,我想打一个大字典的名称以及大量自然语言文本之间的字符串匹配,从而识别文本词典的名字

in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

推荐答案

该线索是这样发现的第一个数据结构。

The TRIE was the first data structure of this kind discovered.

后缀树是在TRIE一个改进(它有后缀链接允许线性误差搜索,后缀树修剪线索的不必要的分支,因此它并不需要尽可能多的空间)。

The suffix tree is a improvement over the TRIE ( it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the TRIE therefore it does not require as much space ).

后缀阵列是基于后缀树一个精简的数据结构(没有后缀链接(慢错误匹配),但模式匹配是非常快的)。

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

该线索不是现实世界使用,因为它占用太多空间。

The TRIE is not for real world use because it consumes too much space.

的后缀树比Trie树更轻,更快和用于索引的ADN或优化某些大网络搜索引擎。

The suffix tree is lighter and faster than the Trie and is used to index ADN or optimize some large web search engines.

后缀数组是比后缀树的一些模式搜索速度较慢,但​​占用较少的空间,并且比后缀树更广泛的应用。

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

在同一个家庭的数据结构:

In the same family of data structures:

有其他的实现,CST就是使用后缀阵列和一些额外的数据结构来获得一些后缀树搜索能力后缀树的实现。

There are other implementations, the CST is a implementation of the Suffix tree using a suffix array and some additional data structures to get some of the Suffix tree search capabilities.

该FCST进一步采取它,它实现了一个采样后缀树后缀数组。

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

该DFCST是FCST的动态版本。

The DFCST is a dynamic version of the FCST.

拓展:

这两个重要的因素是空间的使用和操作执行时间。你可能会认为,与现代的机器,这是不相关的,而是指数单人的是,你将需要记忆40千兆字节的DNA(使用pssed一个uncom $ P $和未优化的后缀树)。并建立了这么多的数据需要几天这个指标之一。想象一下,谷歌,它有很多可搜索的数据,他们需要在所有网络数据的大型指数,他们没有一个人每建立一个网页时改变它。他们有一些形式的缓存为的。然而,主要指标可能是静态的。而且每隔几周或让他们收集所有新的Web站点和数据,并建立一个新的索引,取代旧的新的完成时。我不知道他们使用的索引的算法,但它可能是一个后缀阵列在一个分区数据库后缀树属性。

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being you would require 40 gigabytes of memory (using a uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

科委使用8千兆字节,不过后缀树的操作速度严重下降。

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

后缀列可以做一些700 MEGAS同为2千兆。然而,你将不会在后缀数组中的DNA遗传找到错误(意思是:搜索与通配符模式得多慢得多)。

The Suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

在FCST(完全融为一体pressed后缀树)可以创建800千兆1.5后缀树。随着对CST一个相当小的速度恶化。

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

该DFCST使用比FCST多20%的空间,并失去速度静态实施FCST的。 (但是一个动态的指标是非常重要的)

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST. (however a dynamic index is very important)

有没有很多可行的(空间WISE)的后缀树的实现,因为它是很难使操作速度提升补偿数据结构RAM空间成本。

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures ram space cost.

这表示,后缀树有图案的错误匹配非常有趣的搜索结果。该AHO corasick是不一样快(虽然几乎一样快一些operatons,不误匹配)和博耶·摩尔在尘土中离开了。

This said the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operatons, not error matching) and the boyer moore is left in the dust.

这篇关于特里与后缀树与后缀数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆