如何使这个Python Scrabble单词查找器变得更快? [英] How can this Python Scrabble word finder be made faster?

查看:116
本文介绍了如何使这个Python Scrabble单词查找器变得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有真正需要改进的地方,这只是为了好玩.现在,在大约20万个单词的列表中,大约需要花费一秒钟的时间.

I have no real need to improve it, it's just for fun. Right now it's taking about a second on a list of about 200K words.

我已尽我所能对其进行了尽可能多的优化(使用生成器而不是列表推导有很大的不同),并且我已经没有足够的想法了.

I've tried to optimize it as much as I know how (using generators instead of list comprehensions made a big difference), and I've run out of ideas.

你有吗?

#!/usr/bin/env python
# let's cheat at scrabble

def count_letters(word):
  count = {} 
  for letter in word:
    if letter not in count: count[letter] = 0
    count[letter] += 1 
  return count 

def spellable(word, rack):
    word_count = count_letters(word)
    rack_count  = count_letters(rack)
    return all( [word_count[letter] <= rack_count[letter] for letter in word] )  

score = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
         "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
         "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
         "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
         "x": 8, "z": 10}

def score_word(word):
  return sum([score[c] for c in word])

def word_reader(filename):
  # returns an iterator
  return (word.strip() for word in  open(filename)) 

if __name__ == "__main__":
  import sys
  if len(sys.argv) == 2: 
    rack = sys.argv[1].strip()
  else:
    print """Usage: python cheat_at_scrabble.py <yourrack>"""
    exit()

  words = word_reader('/usr/share/dict/words')
  scored =  ((score_word(word), word) for word in words if set(word).issubset(set(rack)) and len(word) > 1 and spellable(word, rack))

  for score, word in sorted(scored):
    print str(score), '\t', word

推荐答案

在不偏离基本代码的情况下,以下是一些相当简单的优化:

Without going too far from your basic code, here are some fairly simple optimizations:

首先,将单词阅读器更改为:

First, change your word reader to be:

def word_reader(filename, L):
  L2 = L+2
  # returns an iterator
  return (word.strip() for word in open(filename) \
          if len(word) < L2 and len(word) > 2)

并命名为

words = word_reader('/usr/share/dict/words', len(rack))

在我建议的所有更改中,这是最大的改进.它消除了太长或太短的单词,直到我们在过程中走得太远.请记住,在我的比较中,word没有换行符.我假设使用'\ n'行分隔符.另外,列表中的最后一个单词可能有问题,因为它的结尾可能没有换行符,但是在我的计算机上,最后一个单词是études,无论如何我们的方法都找不到.当然,您可以预先从原始字典中创建自己的字典,以删除无效的字典:长度不正确或字母大小超过a-z的

This gives the biggest improvement of all of my suggested changes. It eliminates words that are too long or short before we get too far in the process. Remember that word is unstripped of new line characters in my comparisons. I assumed '\n' line separators. Also, there could be a problem with the last word in the list because it probably won't have a new line at the end of it, but on my computer the last word is études which won't be found with our method anyway. Of course, you could just create your own dictionary beforehand from the original that removes those that aren't valid: those that aren't the right length or have letters outsize of a-z.

接下来,Ferran建议为机架组设置一个变量,这是一个好主意.但是,从每个单词中选出一个集合也会使您的工作变得相当缓慢.完全使用这些设置的目的是淘汰很多根本没有任何设置的设置,从而加快速度.但是,我发现在打电话给可拼的单词之前,只检查单词的第一个字母是否在架子上就更快了.

Next, Ferran suggested a variable for the rack set, which is a good idea. However, you are also getting a pretty major slow down from making a set out of every word. The purpose of using the sets at all was to weed out a lot of the ones that don't have any shot at all and thereby give a speed-up. However, I found it was even faster to just check if the first letter of the word was in the rack before calling spellable:

rackset = frozenset(rack)
scored =  [(score_word(word), word) for word in words if word[0] in rackset \
           and spellable(word, rack)]

但是,必须同时将其更改为可拼写.我将其更改为以下内容:

However, this has to be accompanied by a change to spellable. I changed it to the following:

def spellable(word, rack):
    return all( [rack.count(letter) >= word.count(letter) \
                 for letter in set(word)] )

即使没有上一步的更改,它也比您现在拥有的要快.

which, even without the changes in the previous step, is faster than what you currently have.

通过上述三个更改,代码比我的简单测试快了约3倍.

With the three changes above, the code was about 3x faster from my simple tests.

采用更好的算法

由于您实际上是在寻找字谜,因此使用字谜字典是很有意义的.字谜字典将字典中的每个单词组合在一起(如果它们是字谜).例如,"takes"和"skate"彼此相似,因为它们在排序时都等于"aekst".我创建了一个字谜字典作为文本文件,其格式为每行构成一个条目.每个条目都具有字谜的排序版本的排序版本,然后是字谜本身.例如,我使用的条目是

Since what you are really doing is looking for anagrams, it makes sense to use an anagram dictionary. An anagram dictionary takes each word in a dictionary and groups them if they are anagrams. For instance, 'takes' and 'skate' are anagrams of each other because they are both equal to 'aekst' when sorted. I created an anagram dictionary as a text file with the format where on each line constitutes an entry. Each entry has the sorted version of the sorted version of the anagrams and then the anagrams themselves. For the example I'm using the entry would be

aekst skate takes

然后,我可以将机架字母的组合进行组合,并在字谜字典中对每个字母进行二进制搜索,以查看是否存在匹配项.对于7个字母的架子,最多可以有120个唯一的拼字游戏有效字母组合.执行二进制搜索为O(log(N)),因此这将非常快.

Then I can just take combinations of the rack letters and do a binary search for each one in the anagram dictionary to see if there is a match. For a 7-letter rack, there is a maximum of 120 unique scrabble-valid combinations of the letters. Performing a binary search is O(log(N)) so this will be very fast.

我分两部分实现了该算法.第一个制作字谜字典,第二个制作实际的拼字游戏作弊程序.

I implemented the algorithm in two parts. The first makes the anagram dictionary and the second is the actual scrabble cheating program.

Anagram词典创建者代码

f = open('/usr/share/dict/words')
d = {}
lets = set('abcdefghijklmnopqrstuvwxyz\n')
for word in f:
  if len(set(word) - lets) == 0 and len(word) > 2 and len(word) < 9:
    word = word.strip()
    key = ''.join(sorted(word))
    if key in d:
      d[key].append(word)
    else:
      d[key] = [word]
f.close()
anadict = [' '.join([key]+value) for key, value in d.iteritems()]
anadict.sort()
f = open('anadict.txt','w')
f.write('\n'.join(anadict))
f.close()

抓取作弊代码

from bisect import bisect_left
from itertools import combinations
from time import time

def loadvars():
  f = open('anadict.txt','r')
  anadict = f.read().split('\n')
  f.close()
  return anadict

scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
         "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
         "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
         "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
         "x": 8, "z": 10}

def score_word(word):
  return sum([scores[c] for c in word])

def findwords(rack, anadict):
  rack = ''.join(sorted(rack))
  foundwords = []
  for i in xrange(2,len(rack)+1):
    for comb in combinations(rack,i):
      ana = ''.join(comb)
      j = bisect_left(anadict, ana)
      if j == len(anadict):
        continue
      words = anadict[j].split()
      if words[0] == ana:
        foundwords.extend(words[1:])
  return foundwords

if __name__ == "__main__":
  import sys
  if len(sys.argv) == 2:
    rack = sys.argv[1].strip()
  else:
    print """Usage: python cheat_at_scrabble.py <yourrack>"""
    exit()
  t = time()
  anadict = loadvars()
  print "Dictionary loading time:",(time()-t)
  t = time()
  foundwords = set(findwords(rack, anadict))
  scored = [(score_word(word), word) for word in foundwords]
  scored.sort()
  for score, word in scored:
    print "%d\t%s" % (score,word)
  print "Time elapsed:", (time()-t)

七巧板字典的创建者在我的计算机上花费了大约半秒钟的时间.创建字典后,运行拼字游戏作弊程序的速度比OP的代码快 15x ,而在我进行上述更改后,运行的拼字作弊程序的速度比OP的代码快5倍.另外,加载字典的启动时间比实际从机架中搜索单词要大得多,因此这是一次进行多次搜索的更好方法.

The anagram dictionary creator takes about a half a second on my machine. When the dictionary is already created, running the scrabble cheating program is about 15x faster than the OP's code and 5x faster than OP's code after my aforementioned changes. Also, the start-up time of loading the dictionary is much larger than actually searching for words from a rack, so this is a much better way for doing multiple searches at once.

这篇关于如何使这个Python Scrabble单词查找器变得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆