用于查找缺少字母的单词的良好算法和数据结构? [英] Good algorithm and data structure for looking up words with missing letters?

查看:26
本文介绍了用于查找缺少字母的单词的良好算法和数据结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我需要编写一个有效的算法来查找字典中缺少字母的单词,并且我想要一组可能的单词.

so I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.

例如,如果我有 th??e,我可能会取回这些、那些、主题、那里等

For example, if I have th??e, I might get back these, those, theme, there.etc.

我想知道是否有人可以建议我应该使用的一些数据结构或算法.

I was wondering if anyone can suggest some data structures or algorithm I should use.

谢谢!

Trie 空间效率太低,而且速度太慢.还有其他想法修改吗?

A Trie is too space-inefficient and would make it too slow. Any other ideas modifications?

更新:最多会有两个问号,当出现两个问号时,它们会依次出现.

UPDATE: There will be up to TWO question marks and when two question marks do occur, they will occur in sequence.

目前我使用 3 个哈希表来表示完全匹配、1 个问号和 2 个问号.给定一本字典,我对所有可能的单词进行哈希处理.例如,如果我有单词 WORD.我散列 WORD、?ORD、W?RD、WO?D、WOR?、??RD、W??D、WO??.进入字典.然后我使用链接列表将碰撞链接在一起.所以说 hash(W?RD) = hash(STR?NG) = 17. hashtab(17) 将指向 WORD 而 WORD 指向 STRING 因为它是一个链表.

Currently I am using 3 hash tables for when it is an exact match, 1 question mark, and 2 question marks. Given a dictionary I hash all the possible words. For example, if I have the word WORD. I hash WORD, ?ORD, W?RD, WO?D, WOR?, ??RD, W??D, WO??. into the dictionary. Then I use a link list to link the collisions together. So say hash(W?RD) = hash(STR?NG) = 17. hashtab(17) will point to WORD and WORD points to STRING because it is a linked list.

平均查找一个单词的时间约为 2e-6s.我希望做得更好,最好是 1e-9.

The timing on average lookup of one word is about 2e-6s. I am looking to do better, preferably on the order of 1e-9.

我没有再看这个问题,但插入 3m 条目需要 0.5 秒,而 3m 条目查找需要 4 秒.

I haven't looked at this problem again but it took 0.5 seconds for 3m entries insertions and it took 4 seconds for 3m entries lookup.

谢谢!

推荐答案

我相信在这种情况下最好只使用一个平面文件,其中每个单词都在一行中.有了它,您可以方便地使用正则表达式搜索的强大功能,该搜索经过高度优化,可能会击败您为解决此问题而自行设计的任何数据结构.

I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

这是针对此问题的 Ruby 代码:

This is working Ruby code for this problem:

def query(str, data)    
  r = Regexp.new("^#{str.gsub("?", ".")}$")
  idx = 0
  begin
    idx = data.index(r, idx)
    if idx
      yield data[idx, str.size]
      idx += str.size + 1
    end
  end while idx
end

start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
  puts w
end
puts Time.now - start_time

文件 wordlist.txt 包含 45425 个单词(可下载 此处).查询 ?r?te 的程序输出是:

The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:

brute
crate
Crete
grate
irate
prate
write
wrote
0.013689

因此,读取整个文件并找到其中的所有匹配项只需要 37 毫秒.它适用于各种查询模式,即使在 Trie 非常慢的情况下也能很好地扩展:

So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

查询????????????????e

counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681

查询?h?a?r?c?l?

theatricals
0.013608

这对我来说已经足够快了.

This looks fast enough for me.

如果您想走得更快,您可以将词表拆分为包含等长词的字符串,然后根据您的查询长度搜索正确的词.用以下代码替换最后 5 行:

If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

def query_split(str, data)
  query(str, data[str.length]) do |w|
    yield w
  end
end

# prepare data    
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
  data[w.length-1] += w
end

# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
  puts w
end
puts Time.now - start_time

构建数据结构现在需要大约 0.4 秒,但所有查询都快了大约 10 倍(取决于具有该长度的单词数):

Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):

  • ?r?te 0.001112 秒
  • ?h?a?r?c?l? 0.000852 秒
  • ??????????????????e 0.000169 秒
  • ?r?te 0.001112 sec
  • ?h?a?r?c?l? 0.000852 sec
  • ????????????????e 0.000169 sec

由于您更改了要求,您可以轻松扩展您的想法,只使用一个包含所有预先计算结果的大哈希表.但是,您可以依靠正确实现的哈希表的性能,而不是自己解决冲突.

Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

这里我创建了一个大哈希表,其中每个可能的查询都映射到其结果列表:

Here I create one big hashtable, where each possible query maps to a list of its results:

def create_big_hash(data)
  h = Hash.new do |h,k|
    h[k] = Array.new
  end    
  data.each_line do |l|
    w = l.strip
    # add all words with one ?
    w.length.times do |i|
      q = String.new(w)
      q[i] = "?"
      h[q].push w
    end
    # add all words with two ??
    (w.length-1).times do |i|
      q = String.new(w)      
      q[i, 2] = "??"
      h[q].push w
    end
  end
  h
end

# prepare data    
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data
#{h.size} entries in big hash"

# use prepared data for query
t = Time.new
h["?ood"].each do |w|
  puts w
end
puts (Time.new - t)

输出是

4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05

查询性能是O(1),它只是在哈希表中查找.时间 2.0e-05 可能低于计时器的精度.运行 1000 次后,每次查询的平均时间为 1.958e-6 秒.为了更快,我会切换到 C++ 并使用 Google Sparse Hash,它具有极高的内存效率,而且很快.

The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

以上所有解决方案都有效,并且对于许多用例来说应该足够好.如果你真的想认真起来并有很多空闲时间,请阅读一些好论文:

All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

  • Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
  • Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
  • Google Scholar search for approximate string matching - More than enough to read on this topic.

这篇关于用于查找缺少字母的单词的良好算法和数据结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆