高效搜索字符串中的单词 [英] Efficient search for words in string

查看:48
本文介绍了高效搜索字符串中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含单词列表的字典和一个字符串 URL.我想在使用分隔符将 URL 分解为标记后找到包含在 URL 中的所有单词.现在,我正在针对大于某个数字的每个标记测试字典中的每个单词(使用 java 的 String contains 函数).例如,我在 wunderground 中为 www.wunderground.com 搜索ground"之类的词

I have a dictionary containing a list of words and I have a string URL. I want to find all the words contained in the URL after it has been decomposed into tokens using delimiters. Right now, I am testing for each word in the dictionary against each tokens greater than a certain number (using java's String contains function). For example, I search for words like "ground" in wunderground for www.wunderground.com

我相信有一种更有效的方法可以做到这一点.有什么想法吗?

I am sure there is a more efficient way of doing that. Any ideas?

推荐答案

如果将字典加载到 HashMap 中,则可以测试每个候选子字符串(wunderground"、underground"、nderground"、derground"、..., "wundergroun", ..., "under", ... "ground", ...) 非常快,特别是在 O(1) 时间内.

If you load your dictionary into a HashMap, you can test each candidate substring ("wunderground", "underground", "nderground", "derground", ..., "wundergroun", ..., "under", ... "ground", ...) very quickly, specifically, in O(1) time.

衡量效率:弄清楚它必须执行多少步骤.我们将估计其 big-O 复杂性.

To gauge efficiency: Figure out how many steps it has to do. We'll estimate its big-O complexity.

您当前的算法必须遍历整个字典:工作与字典大小成正比,D 个条目).对于每个字典单词,它调用 contains():工作与 URL 单词的大小成正比,C 字符减去平均字典单词大小,我们称之为 5.对于 URL 中的每个单词,D * (C - 5) 步的顺序,O(D * (C - 5)).

Your current algorithm has to loop through the entire dictionary: work proportional to the dictionary size, D entries). For each dictionary word, it calls contains(): work proportional to the size of the URL word, C chars, minus the average dictionary word size, which we'll call 5. So that takes on the order of D * (C - 5) steps, O(D * (C - 5)), for each word in the URL.

构建哈希表后,查找的成本与条目数无关.C 字符的每个 URL 术语都有 C2 个子字符串.如果您将其修剪为至少 5 个字符的子字符串,那就是 (C - 5)2 子字符串.[嗯,从技术上讲它是 (C - 5) * (C - 4)/2,但我们正在计算渐近复杂性,这是大图的近似值.] 所以在字典中查找它们的成本是 (C -5)2 步骤.同样,这是针对 URL 中的每个单词,与字典大小无关.

After you build a hash table, the cost of a lookup is independent of the number of entries. Each URL term of C chars has C2 substrings. If you prune it to substrings of at least 5 chars, that's (C - 5)2 substrings. [Well, technically it's (C - 5) * (C - 4) / 2, but we're figuring asymptotic complexity, which is the big picture approximation.] So the cost to look them all up in the dictionary is (C - 5)2 steps. Again, that's for each word in the URL and independent of the dictionary size.

假设您的字典有 10,000 个条目,而平均 URL 术语长度为 10 个字符.然后,旧算法需要每个 URL 术语 50,000 步,而哈希算法每个 URL 术语需要 25 个步骤.有道理吗?

Let's say your dictionary has 10,000 entries and the average URL term is 10 chars long. Then the old algorithm takes 50,000 steps per URL term while the hash algorithm takes 25 steps per URL term. Make sense?

这篇关于高效搜索字符串中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆