Trie vs. 后缀树 vs. 后缀数组 [英] Trie vs. suffix tree vs. suffix array

查看:26
本文介绍了Trie vs. 后缀树 vs. 后缀数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

哪种结构提供了最好的性能结果;trie(前缀树)、后缀树还是后缀数组?还有其他类似的结构吗?这些结构有哪些好的 Java 实现?

Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?

在这种情况下,我想在大型名称词典和大型自然语言文本集之间进行字符串匹配,以识别文本词典的名称.

in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

推荐答案

trie 是第一个发现的此类数据结构.

The trie was the first data structure of this kind discovered.

后缀树是对 trie 的改进(它具有允许线性错误搜索的后缀链接,后缀树修剪了 trie 的不必要分支,因此它不需要那么多空间).

The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).

后缀数组是基于后缀树的精简数据结构(没有后缀链接(错误匹配慢),但模式匹配非常快).

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

trie 不适合现实世界使用,因为它占用太多空间.

The trie is not for real world use because it consumes too much space.

后缀树比 trie 更轻、更快,用于索引 DNA 或优化一些大型网络搜索引擎.

The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.

后缀数组在某些模式搜索中比后缀树慢,但占用空间更少,比后缀树使用更广泛.

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

在同一个数据结构家族中:

In the same family of data structures:

还有其他的实现,CST是后缀树的一个实现,使用一个后缀数组和一些额外的数据结构来获得一些后缀树的搜索能力.

There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.

FCST 更进一步,它实现了一个带有后缀数组的采样后缀树.

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

DFCST 是 FCST 的动态版本.

The DFCST is a dynamic version of the FCST.

扩展:

两个重要的因素是空间使用和操作执行时间.您可能认为对于现代机器,这无关紧要,但索引一个人的 DNA 需要 40 GB 的内存(使用未压缩和未优化的后缀树).并且在这么多数据上构建其中一个索引可能需要数天时间.想象一下谷歌,它有很多可搜索的数据,他们需要一个涵盖所有网络数据的大索引,并且每次有人构建网页时他们都不会更改它.他们有某种形式的缓存.然而,主索引可能是静态的.每隔几周左右,他们就会收集所有新的网站和数据,并建立一个新的索引,在新的完成后替换旧的.我不知道他们使用哪种算法来索引,但它可能是分区数据库上具有后缀树属性的后缀数组.

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

CST 使用 8GB,但是后缀树的操作速度大大降低.

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

后缀阵列可以在大约 700 兆到 2 千兆的范围内执行相同的操作.但是,您不会在带有后缀数组的 DNA 中发现遗传错误(意思是:搜索带有通配符的模式要慢得多).

The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

FCST(完全压缩的后缀树)可以创建一个 800 到 1.5 千兆的后缀树.向 CST 的速度恶化相当小.

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

DFCST 使用的空间比 FCST 多 20%,并且速度不如 FCST 的静态实现(但是动态索引非常重要).

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).

后缀树的可行(空间明智)实现并不多,因为很难使运算速度提升补偿数据结构 RAM 空间成本.

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.

这就是说,后缀树对于有错误的模式匹配有非常有趣的搜索结果.aho corasick 没有那么快(尽管对于某些操作来说几乎一样快,而不是错误匹配)并且 boyer moore 被抛在了脑后.

This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.

这篇关于Trie vs. 后缀树 vs. 后缀数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆