用于在字符串中搜索子字符串的快速算法 [英] Fast algorithm for searching for substrings in a string

查看:26
本文介绍了用于在字符串中搜索子字符串的快速算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个高效的算法(或库),我可以在 Java 中使用它来搜索字符串中的子字符串.

我想做的是:

给定一个输入字符串 - INSTR:

<块引用>

BCDEFGH"

以及一组候选字符串 - CAND:

<块引用>

AB"、CDE"、FG"、H"、IJ"

查找在 INSTR

中作为子字符串匹配的任何 CAND 字符串

在这个例子中,我将匹配CDE"、FG"和H"(但不匹配AB"和IJ")

可能有数千个候选字符串(在 CAND 中),但更重要的是,我将进行数百万次此搜索,因此我需要快速搜索.

我想使用字符数组.此外,我对架构解决方案并不感兴趣,例如分发搜索 - 只是在本地进行最有效的功能/算法.

此外,CAND 和 INSTR 中的所有字符串都将相对较小(<50 个字符)——即目标字符串 INSTR 相对于候选字符串不长.

<小时>

更新 我应该提到,CAND 字符串集对于 INSTR 的所有值都是不变的.

更新我只需要知道有匹配 - 而我不需要知道匹配是什么.

最终更新由于实施简单,我选择尝试 AhoCorsick 和 Rabin-Karp.因为我有可变长度的模式,所以我使用了一个修改过的 Rabin-Karp,它对每个模式的前 n 个字符进行散列,其中 n 是最小模式的长度,然后 N 是我的滚动子字符串搜索窗口的长度.对于 Aho Corsick,我使用了 this

在我的测试中,我在两个文档新闻论文文章中搜索了 1000 个模式,平均跨越 1000 次迭代等......完成的标准化时间为:

AhoCorsick:1

RabinKarp:1.8

Naive Search(检查每个模式并使用 string.contains):50

<小时>

*描述以下答案中提到的算法的一些资源:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.Princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

解决方案

阅读 Aho-Corasick 算法Rabin-Karp 算法.>

如果输入不是太大,您不想重复搜索很多次并且您没有很多模式,那么多次使用单一模式算法可能是个好主意.维基百科关于搜索算法的文章 提供了许多具有运行和预处理时间的算法.

实现:

演示文稿:

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - INSTR:

"BCDEFGH"

And a set of candidate strings - CAND:

"AB", "CDE", "FG", "H", "IJ"

Find any CAND strings that match as substrings within INSTR

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.


Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

Update I only need to know that there was a match - and i don't need to know what the match was.

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick: 1

RabinKarp: 1.8

Naive Search (check each pattern & use string.contains): 50


*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

解决方案

Read up on the Aho-Corasick algorithm and the Rabin-Karp algorithm.

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times.

Implementations:

Presentations:

这篇关于用于在字符串中搜索子字符串的快速算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆