什么是一个简单的方法来有效地找到在很短的未知字符串特定字词或词组? [英] What's a simple way to efficiently find specific terms or phrases within a short unknown string?

查看:434
本文介绍了什么是一个简单的方法来有效地找到在很短的未知字符串特定字词或词组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

工作在通过twitterfeed可视化。我有一个大的数据集。我只想使用包含单词的特定字符串的tweet消息。

Working on a twitterfeed visualization. I have a big dataset. I only want to use tweet messages that contain specific strings of words.

我现在有这样一行:

数据= data.filter(功能(D,I){返回d.text.indexOf('新年')= - 1真:假;});

data = data.filter(function(d, i) { return d.text.indexOf('new year')!=-1 ? true : false;});

它返回包含字符串新的一年通过twitterfeed所有的鸣叫。做工精细! :)

It returns all the tweets in a twitterfeed that contain the string 'new year'. Works fine! :)

但我怎么选择多个字符串?

But how do I select multiple strings?

其实,我想这一块也返回包含的变化例如NEWYEAR和/或新年快乐和/或快乐2013​​的微博和/或拼写错误等。

Actually, I want this piece to also return the tweets that contain variations like 'newyear' and/or 'happy new year' and/or 'happy 2013' and/or spelling errors etc.

希望有人能帮助我..

Hope someone can help me..

快乐2013!

推荐答案

这是一个pretty的经典字符串搜索/字符串匹配问题。第一,一些术语:字符串匹配算法通常所说的搜索空间为文本 - 在此情况下,你的鸣叫或鸣叫;和模式(S) - 您的搜索字词

This is a pretty classic string-search / string-matching problem. First, some terminology: String matching algorithms usually refer to the search space as the 'text' - in this case, your tweet or tweets; and the 'pattern(s)' - your search terms.

的大部分串匹配算法的复杂性,测量在文字,图案(多个)的长度,并且匹配的数目的长度方面

The complexity of most string-matching algorithms is measured in terms of the length of the text, the length of the pattern(s), and the number of matches.

天真的做法当然是嵌套的循环和线性搜索。伪code:

The naive approach is of course nested loops and linear search. Pseudocode:

foreach text (tweet)
    foreach pattern (search term)
        linear search the text for the pattern

这是O(T * p),其中t为所有文本的总长度,p是所有模式的总长度。您可以在此可能大大改善,特别是如果无论是文字或图案是固定的多次运行,让你做一些pre-处理有效的搜索。看看的字符串搜索算法维基百科的描述,了解一些可能性。

That's O(t * p), where t is the total length of all texts and p is the total length of all patterns. You can probably improve considerably on this, especially if either the text or the patterns are fixed over multiple runs, allowing you to do some pre-processing for efficient search. Take a look at Wikipedia's description of string search algorithms for a few possibilities.

您选择一个特定的算法将很可能取决于你的内存限制和pre-处理时间和运行的复杂性之间的权衡。但我会扔了几件事情来看待。这听起来像你的模式可能是固定的,你的文字可能会有所不同(不同的搜索Twitter的饲料?),所以你可能想看看的的阿霍Corasick算法。您可能会发现一个后缀树一个有用的数据结构,以及。从这些维基百科页面的谷歌搜索这些词的链接,并应帮助你开始(你甚至可以找到实现code的帮助,但我不这样做的JavaScript,所以我不知道该怎么推荐有)。

Your choice of a specific algorithm will probably depend on your memory constraints and the trade-off between pre-processing time and runtime complexity. But I'll throw out a couple things to look at. It sounds like your patterns are probably fixed, and that your text may vary (searching different twitter feeds?), so you might want to look at the Aho-Corasick algorithm. You might find a suffix tree a useful data structure as well. The links from those Wikipedia pages, and a Google search for those terms should help you get started (you might even find implemented code to help, although I don't do JavaScript, so I wouldn't know what to recommend there).

这篇关于什么是一个简单的方法来有效地找到在很短的未知字符串特定字词或词组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆