算法来查找多个字符串匹配 [英] Algorithm to find multiple string matches

查看:127
本文介绍了算法来查找多个字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找一个有效的算法寻找所有匹配的大段文字的建议。条款来搜索将被包含在列表中,并且可以具有1000 +可能性。搜索条件可以是1个或多个字。

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

很显然,我可以通过比对每一个搜索词的文本进行多次传递。不是太有效率。

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

我想过订货的搜索字词,结合公共子段。这样,我可以迅速消除大量的术语。语言是C ++,我可以使用boost。

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

搜索词的例子可能是世界500强企业名称的列表。

An example of search terms could be a list of Fortune 500 company names.

想法?

推荐答案

这个问题已被深入的研究。奇怪的是,搜索一个模式/串最好的算法,不要轻易推断多串匹配。

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

的grep的家庭实现了非常有效的方式多字符串搜索。如果你可以使用它们作为外部程序,做到这一点。

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

如果你真的需要实现的算法,我认为最快的方法是复制什么agrep做(agrep擅长多串匹配!)。 这里是源和可执行文件。

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

在这里, 你会发现一个文件,介绍了常用的算法,理论背景,和大量的信息和关于字符串匹配指针。

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

务必注意:多字符串匹配已经被人们所喜爱克努特,博耶,摩尔,巴埃萨,耶茨,以及其他大量研究。如果你需要一个非常快的算法不犹豫的站在自己宽阔的肩膀。不要重新发明轮子。

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

这篇关于算法来查找多个字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆