用于查找缺少字母的单词的好的算法和数据结构? [英] Good algorithm and data structure for looking up words with missing letters?

查看:152
本文介绍了用于查找缺少字母的单词的好的算法和数据结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我需要编写一个有效的算法,用于在字典中查找缺少字母的单词,并且我想要一组可能的单词。

so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

例如,如果我有可能会得到这些,那些,主题,there.etc。

For example, if I have th??e, I might get back these, those, theme, there.etc.

我想知道是否有人可以建议一些数据结构或算法,我应该使用

I was wondering if anyone can suggest some data structures or algorithm I should use.

谢谢!

编辑:Trie太空间效率太低,会太慢。任何其他想法修改?

A Trie is too space-inefficient and would make it too slow. Any other ideas modifications?

更新:将有两个问号,当两个问号确实发生时,它们将按顺序发生。

UPDATE: There will be up to TWO question marks and when two question marks do occur, they will occur in sequence.

目前,我正在使用3个哈希表,当它是一个完全匹配,1个问号和2个问号。
给出一个字典我将哈希所有可能的字。例如,如果我有单词WORD。我散列WORD,?ORD,W?RD,WO?D,WOR?,RD RD,W ?? D,WO?进入字典。然后我使用链接列表来将碰撞链接在一起。所以说hash(W?RD)= hash(STR?NG)= 17. hashtab(17)将指向WORD,WORD指向STRING,因为它是一个链表。

Currently I am using 3 hash tables for when it is an exact match, 1 question mark, and 2 question marks. Given a dictionary I hash all the possible words. For example, if I have the word WORD. I hash WORD, ?ORD, W?RD, WO?D, WOR?, ??RD, W??D, WO??. into the dictionary. Then I use a link list to link the collisions together. So say hash(W?RD) = hash(STR?NG) = 17. hashtab(17) will point to WORD and WORD points to STRING because it is a linked list.

一个字的平均查找时间约为2e-6s。我想要做的更好,最好是按1e-9的顺序。

The timing on average lookup of one word is about 2e-6s. I am looking to do better, preferably on the order of 1e-9.

编辑:我还没有看到这个问题,但是对于3m的条目需要0.5秒

I haven't looked at this problem again but it took 0.5 seconds for 3m entries insertions and it took 4 seconds for 3m entries lookup.

谢谢!

推荐答案

我相信在这种情况下,最好只是使用一个平面文件,其中每个单词都在一行中。有了这个,您可以方便地使用正则表达式搜索的功能,这是高度优化的,可能会击败任何数据结构,您可以为自己设计出这个问题。

I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

这是为这个问题工作的Ruby代码:

This is working Ruby code for this problem:

def query(str, data)    
  r = Regexp.new("^#{str.gsub("?", ".")}$")
  idx = 0
  begin
    idx = data.index(r, idx)
    if idx
      yield data[idx, str.size]
      idx += str.size + 1
    end
  end while idx
end

start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
  puts w
end
puts Time.now - start_time

文件 wordlist.txt 包含45425个字(可下载这里)。程序的查询输出?r?te 是:

The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:

brute
crate
Crete
grate
irate
prate
write
wrote
0.013689

因此,只需37毫秒就可以读取整个文件并查找其中的所有匹配项。而且,对于各种查询模式,即使在Trie非常缓慢的情况下也可以很好地扩展:

So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

查询 e

counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681

查询? a?r?c?l?

theatricals
0.013608

这对我来说看起来足够快。

This looks fast enough for me.

如果要更快地移动,可以将wordlist分割成包含相同长度的字符的字符串,并根据查询搜索正确的字符串长度。用以下代码替换最后5行:

If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

def query_split(str, data)
  query(str, data[str.length]) do |w|
    yield w
  end
end

# prepare data    
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
  data[w.length-1] += w
end

# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
  puts w
end
puts Time.now - start_time

构建数据结构现在约为0.4秒,但是所有查询速度都快了10倍(取决于具有该长度的字数):

Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):


  • ?r?te 0.001112秒

  • ?h?a?r?c?l? / code> 0.000852秒

  • e 0.000169 sec

  • ?r?te 0.001112 sec
  • ?h?a?r?c?l? 0.000852 sec
  • ????????????????e 0.000169 sec

既然你已经改变了你的要求,你可以很容易的扩展你的想法,只使用一个包含所有预计算结果的大哈希表。但是,而不是自己解决碰撞,您可以依靠正确实现的散列表的性能。

Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

这里我创建一个大哈希表,每个可能的查询映射到其结果:

Here I create one big hashtable, where each possible query maps to a list of its results:

def create_big_hash(data)
  h = Hash.new do |h,k|
    h[k] = Array.new
  end    
  data.each_line do |l|
    w = l.strip
    # add all words with one ?
    w.length.times do |i|
      q = String.new(w)
      q[i] = "?"
      h[q].push w
    end
    # add all words with two ??
    (w.length-1).times do |i|
      q = String.new(w)      
      q[i, 2] = "??"
      h[q].push w
    end
  end
  h
end

# prepare data    
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"

# use prepared data for query
t = Time.new
h["?ood"].each do |w|
  puts w
end
puts (Time.new - t)

输出是

4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05

查询性能是O(1),它只是在哈希表中查找。时间2.0e-05可能低于定时器的精度。当运行1000次时,每个查询的平均值为1.958e-6秒。为了加快速度,我将切换到C ++并使用非常高效的记忆效率的 Google稀疏哈希 ,并且快速。

The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

所有上述解决方案都可以工作,应该足够好对于很多用例。如果你真的想认真,有很多空余时间在你手上,请阅读一些好的论文:

All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

  • Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
  • Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
  • Google Scholar search for approximate string matching - More than enough to read on this topic.

这篇关于用于查找缺少字母的单词的好的算法和数据结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆