高效的数据结构的字查找使用通配符 [英] Efficient data structure for word lookup with wildcards

查看:188
本文介绍了高效的数据结构的字查找使用通配符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一系列用户inputed字匹配的话大词典(确保输入的值存在)。

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

因此​​,如果用户输入:

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

现在的缺点是,用户还可以输入通配符或一系列的通配符就好说了。

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

关键的要求是:

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.  

如果单词列表的大小,小的时候可以使用一个包含所有单词的字符串和使用正则EX pressions。

If the size of the word list was small I could use a string containing all the words and use regular expressions.

不过鉴于单词列表可能包含潜在enteries几十万我假设这是行不通的。

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

那么,某种形式的树的路来走这...?

So is some sort of 'tree' be the way to go for this...?

在这个任何想法或建议,将完全AP preciated!

Any thoughts or suggestions on this would be totally appreciated!

在此先感谢, 马特

推荐答案

放入中的阿佩尔和雅各布森的在世界上最快的拼字游戏计划(的免费副本哥伦比亚)。为了您的搜索,你会遍历这个图保持了一组指针:上一个字母,你犯了一个确定性的过渡到孩子的信;在通配符,添加所有的孩子设置的。

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

效率会大致相同,汤普森的NFA除pretation grep的(他们是相同的算法)。该DAWG结构的非常的空间高效—远更比只是存储的话自己。而且很容易实现。

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

最差情况下的成本会升高至通配符的数目的功率字母表(26?)的大小。但是,除非您的查询的开始的用N通配符,一个简单的左到右搜索将很好地工作在实践中。我建议禁止查询开始有太多的通配符,或者创建多个dawgs,如耶的镜像,耶为左移三个字符,等等。

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

通配符匹配的任意字符序列,例如, ______ 总是将是昂贵的,因为有组合地许多解决方案。本耶将枚举所有的解决方案非常迅速。

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

这篇关于高效的数据结构的字查找使用通配符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆