什么是开头为和/或包含搜索的最快的字符串收集结构/算法 [英] whats the fastest string collection structure/algorithm for startswith and/or contains searches

查看:95
本文介绍了什么是开头为和/或包含搜索的最快的字符串收集结构/算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下情况:
我有很多字符串(平均说250.000+),平均长度为30。
我要做的是在这些字符串中进行很多搜索。大多数情况下,这些将属于StartsWith和Contains类型。

I have the following situation: I have a big collection of strings (lets say 250.000+) of average length of maybe 30. What I have to do is to do many searches within these .. mostly those will be of StartsWith and Contains kind.

该集合在运行时是静态的。这意味着选择集合的初始读取和填充仅完成一次。因此,构建数据结构的性能绝对不重要。内存也不是问题:这也意味着我不介意在需要时使用两个具有相同数据的集合(例如,一个用于startwith,另一个用于contains)。
唯一重要的是搜索的性能,该搜索应返回与搜索条件匹配的所有元素。

The collection is static at runtime. Which means the initial reading and filling of the collection of choice is done only once .. therefore the performance of building the datastructure is absolutely not important. Memory is also not a problem: which also means that I don't mind having two collections with the same data in each if needed (like one for the startswith and another for contains). Only thing that matters is performance of the searches which should return all elements which match the searchcondition.

首先,我遇到了Trie或Radix树。.但是也许还有更好的选择?

For startswith I came upon a Trie or Radix-tree .. but maybe there are even better choices?

对于包含..我还没有一个好主意(除了在列表上运行linq查询之外,使用该数据量不会很快)。

For contains .. I have no good idea yet at all (beside running a linq query on a list which wont be very fast with that amount of data).

谢谢预先每个人!

更新:我忘了一个重要的部分:包含表示集合中没有完全匹配的内容..但我想在集合中找到所有字符串包含给定的搜索字符串

update: I forgot an important part: with Contains i mean no exact matches in the collection .. but i want to find all strings in the collection which contain the given searchstring

推荐答案

构建后缀树将允许您在 O(1)中并行搜索所有字符串。我的书呆子不禁要注意,它实际上是 O(n + m),其中 n 是数字与您的子字符串匹配的字符串,而 m 就是要查询的子字符串的大小。

Building a suffix tree will allow you to do a substring search on all your strings in parallel in O(1). The pedantic in me can't help but note that it's really O(n + m) where n is the number of strings that match your substring and m is the size of the substring being queried.

那么后缀树是什么你问?在其最基本的实现中,它是一个带有特殊插入方法的特里:除了添加字符串外,它还将该字符串的所有可能后缀添加到特里。在此数据结构上,子字符串搜索成为所有可能后缀的前缀搜索。由于您还想进行前缀搜索,因此,您需要在每个插入的字符串和查询子字符串的前面添加一个特殊字符。特殊字符将使您能够区分后缀和完整字符串。

So what's a suffix tree you ask? In its most basic implementation, it's a trie with a fancier insert method: in addition to adding a string, it also adds every possible suffix of that string to the trie. On this data structure, a substring search becomes a prefix search of all possible suffixes. Since you also want to do prefix searches, you'll want to add a special character in front of each inserted string and the query substrings. The special character will allow you to differentiate between a suffix and a full string.

尽管这种后缀树的实现非常简单,但效率也很低( O(n ^ 2)空间和构建时间)。幸运的是,还有其他更有效的实现可以极大地减少空间和时间范围。 Ukkonen的算法之一在答案,并将空格缩小到 O(n)。您可能还想研究后缀数组,它是后缀树的等效但更有效的表示形式

While this implementation of a suffix tree is remarkably simple, it's also very inefficient (O(n^2) space and build time). Fortunately, there are other more efficient implementations which can greatly reduce the space and time bounds. One of these, Ukkonen's algorithm, is very well explained in this SO answer and brings the space bound down to O(n). You may also want to look into suffix arrays which are an equivalent but more efficient representation of suffix trees.

虽然我知道还有很多后缀树的实现(其中一个可能会成为您的用例的最佳选择),但我只是不知道认识他们。建议您在确定实施方案之前,对该主题进行一些研究。

While I know there are many many more implementations of suffix trees out there (one of which would probably hit the sweet spot for your use case) I just don't know them. I recommend you do some research on the subject before you settle on an implementation.

这篇关于什么是开头为和/或包含搜索的最快的字符串收集结构/算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆