用通配符进行字查找的高效数据结构 [英] Efficient data structure for word lookup with wildcards

查看:298
本文介绍了用通配符进行字查找的高效数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



所以如果用户输入:

我需要匹配一系列用户输入的单词与一个大词典(以确保输入的值存在) / p>

 橙色它应该匹配字典中的橙色条目
/ pre>

现在,用户还可以输入通配符或一系列通配符,如说

 or__ge,这也匹配橙色

关键要求是:

  *这应该尽可能快

*使用最小的金额的内存来实现它

如果单词列表的大小很小,我可以使用一个包含所有的单词和使用正则表达式。



但是,由于单词列表可能包含可能成千上万的单词,我假设这将无法正常工作。 >

所以有些树是w ay去去这个...?



对此的任何想法或建议将是非常感谢!



提前感谢
Matt

解决方案

将您的单词列表放在DAWG(定向非循环字图)中,如 Appel和Jacobsen关于世界上最快的拼字游戏计划的论文免费副本在哥伦比亚)。对于您的搜索,您将遍历此图维持一组指针:在一封信上,您对该字母的孩子进行确定性过渡;在通配符上,您将所有孩子添加到集合中。



效率与Thompson对于grep的NFA解释大致相同(它们是相同的算法)。 DAWG结构非常具有空间效率,远远超过了仅存储单词本身。而且很容易实现。



最差的成本将是通配符数量的字母(26?)的大小。但除非您的查询开始使用N个通配符,否则一个简单的从左到右的搜索将在实践中运行良好。我建议禁止一个查询从太多的通配符开始,或者创建多个dawgs,例如dawg镜像,dawg左旋三个字符等等。



匹配任意序列的通配符,例如 ______ 总是会变得昂贵,因为组合上有许多解决方案。 dawg将很快列举所有解决方案。


I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.  

If the size of the word list was small I could use a string containing all the words and use regular expressions.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

So is some sort of 'tree' be the way to go for this...?

Any thoughts or suggestions on this would be totally appreciated!

Thanks in advance, Matt

解决方案

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

这篇关于用通配符进行字查找的高效数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆