查找多个字符串匹配的算法 [英] Algorithm to find multiple string matches

查看:30
本文介绍了查找多个字符串匹配的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关在大量文本中查找所有匹配项的有效算法的建议.要搜索的术语将包含在一个列表中,并且可以有 1000 多种可能性.搜索词可以是 1 个或多个字词.

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

显然,我可以多次通过文本与每个搜索词进行比较.效率不高.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

我考虑过对搜索词进行排序并组合常见的子细分.这样我就可以快速消除大量的术语.语言是 C++,我可以使用 boost.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

搜索词的示例可以是财富 500 强公司名称列表.

An example of search terms could be a list of Fortune 500 company names.

想法?

推荐答案

不要重新发明轮子

这个问题已被深入研究.奇怪的是,搜索 ONE 模式/字符串的最佳算法并不能轻易外推到多字符串匹配.

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

"grep" 系列以非常有效的方式实现了多字符串搜索.如果您可以将它们用作外部程序,那就去做吧.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

如果您真的需要实现该算法,我认为最快的方法是重现 agrep 的功能(agrep 在多字符串匹配方面表现出色!).这里是源文件和可执行文件.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

这里你会找到一篇描述所用算法、理论背景和很多关于字符串匹配的信息和指针.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

注意:多字符串匹配已经被 Knuth、Boyer、Moore、Baeza-Yates 等人进行了大量研究.如果您需要一个非常快速的算法,请不要犹豫站在他们宽阔的肩膀上.不要重新发明轮子.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

这篇关于查找多个字符串匹配的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆