从较小的字符串中的一大组字符串中查找所有匹配项 [英] Finding all matches from a large set of Strings in a smaller String

查看:119
本文介绍了从较小的字符串中的一大组字符串中查找所有匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的单词和短语(词典或词典),其中包含通配符。我需要在一个较小的字符串(目前约150个字符)中找到这些单词和短语的所有实例。

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).

最初,我想反向运行该操作;这是要检查我的较小字符串中的每个单词是否在Lexicon中存在,可以将其实现为哈希表。问题是我的词典中的某些值不是单词,而很多是通配符(例如substri *)。

Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).

我正在考虑使用 Rabin-Karp算法,但我不确定这是最佳选择。

I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.

什么是执行此操作的有效算法或方法?

示例数据

该词典包含数百个单词,并且可能会扩展。这些词可能以通配符(星号)结尾。以下是一些随机示例:

The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:




  • 已释放*

  • 粗心*

  • 重大损失

  • good
  • bad
  • freed*
  • careless*
  • great loss

我们正在分析的文本(此时)是简短的,非正式的(语法上的)英语陈述。文本的主要示例(同样是在此时)是Twitter Tweet。这些限制为大约140个字符。例如:

The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:

Just got the Google nexus without a contract. Hands down its the best phone 
I've ever had and the only thing that could've followed my N900.

虽然可能需要注意的是,我们对本文进行了非常简单的情感分析;我的关注点不是我们的情感分析技术。我只是将现有解决方案迁移到实时处理系统,并且需要执行一些优化。

While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.

推荐答案

I认为这是 Aho-Corasick字符串匹配算法,专门用于在单个字符串中查找大量字符串的所有匹配项。它分两个阶段运行:第一阶段创建匹配的自动机(可以预先完成,只需要线性时间),第二阶段使用自动机来查找所有匹配项(只需要线性时间) ,再加上与比赛总数成正比的时间)。该算法还可以调整为也支持通配符搜索。

I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.

希望这会有所帮助!

这篇关于从较小的字符串中的一大组字符串中查找所有匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆