使用通配符查找单词的高效数据结构 [英] Efficient data structure for word lookup with wildcards

查看:20
本文介绍了使用通配符查找单词的高效数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一系列用户输入的词与一个大的词词典进行匹配(以确保输入的值存在).

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

所以如果用户输入:

"orange" it should match an entry "orange' in the dictionary.

现在问题是用户还可以输入通配符或一系列通配符,例如 say

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

关键要求是:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.  

如果单词列表的大小很小,我可以使用包含所有单词的字符串并使用正则表达式.

If the size of the word list was small I could use a string containing all the words and use regular expressions.

然而,鉴于单词列表可能包含数十万个条目,我认为这行不通.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

那么某种树"是解决这个问题的方法吗...?

So is some sort of 'tree' be the way to go for this...?

对此的任何想法或建议将不胜感激!

Any thoughts or suggestions on this would be totally appreciated!

提前致谢,马特

推荐答案

按照 Appel 和 Jacobsen 关于世界上最快的拼字游戏程序的论文(免费副本 在哥伦比亚).为了您的搜索,您将遍历此图并维护一组指针:在一个字母上,您对带有该字母的子项进行确定性转换;在通配符上,您将所有子项添加到集合中.

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

效率将与 Thompson 对 grep 的 NFA 解释大致相同(它们是相同的算法).DAWG 结构非常节省空间——远不止存储单词本身.而且很容易实现.

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

最坏情况下的成本将是字母表的大小(26?)乘以通配符的数量.但是,除非您的查询以 N 个通配符开始,否则简单的从左到右的搜索在实践中会很有效.我建议禁止查询以过多的通配符开头,否则创建多个 dawg,例如,dawg 表示镜像,dawg 表示向左旋转三个字符,等等.

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

匹配任意序列的通配符,例如,______ 总是很昂贵,因为有很多组合的解决方案.Dawg 将非常快速地枚举所有解决方案.

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

这篇关于使用通配符查找单词的高效数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆