算法:哪组长度为 N 的图块可用于生成最多数量的 Scrabble 有效单词? [英] Algorithm: What set of tiles of length N can be used to generate the most amount of Scrabble-valid words?

查看:26
本文介绍了算法:哪组长度为 N 的图块可用于生成最多数量的 Scrabble 有效单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个函数 best_tiles,它接收您手中的瓷砖数量并返回一组瓷砖,使您能够生成最多数量的唯一英语有效单词,假设您只能使用每个磁贴一次.

I'm trying to create a function best_tiles which takes in the number of tiles in your hand and returns the set of tiles that allows you to produce the most number of unique English-valid words, assuming that you can only use each tile once.

例如,用你手中的那组牌(A, B, C),你可以生成单词,CAB, BAC, AB, 和 BA(这些都是英文单词),因此您可以使用该组拼写 4 个独特的单词.使用(B, A, A),你可以拼出5个单词:ABA、BAA、AA、AB和BA.目标是找到一组字母,使您能够拼出最多数量的英语有效单词(无需替换).

For example, with the set of tiles in your hand (A, B, C) you can produce the words, CAB, BAC, AB, and BA (all of these are English words), so you can spell 4 unique words with that set. With (B, A, A), you can spell 5 words: ABA, BAA, AA, AB, and BA. The goal is to find the set of letters which allows you to spell the most number of English-Valid words (without replacement).

因此,如果 5 是 N = 3 的任意字母组合可以拼写的最大单词数,则运行 best_tiles( n = 3 ) 将返回 B, A, A.

So if 5 was the maximum number of words that could be spelled with any combination of letters for N = 3, running best_tiles( n = 3 ) would return B, A, A.

我想知道如何有效地实现这一点?我目前的方法在字母数量方面根本不能很好地扩展.

I'm wondering how to implement this efficiently? My current approach doesn't scale well at all with number of letters.

我读了一个词表.在这种情况下,我在这里使用 enable.txt:https://www.wordgamedictionary.com/enable/

I read in a wordlist. In this case, I'm using enable.txt here: https://www.wordgamedictionary.com/enable/

import os
path = "enable.txt"
words = []
with open(path , encoding='utf8') as f: 
    for values in f:
        words.append(list(values.strip().upper()))

我创建了一个函数 word_in_tiles h/t smack89,它返回是否可以在给定一个 tile 集的情况下构造一个单词:

I create a function word_in_tiles h/t smack89 which returns whether it is possible to construct a word given a tile set:

def word_in_tiles(word, tiles):
  tiles_counter = collections.Counter(tiles)
  return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in 
  collections.Counter(word).items())

然后我创建了一个函数 get_all_words,它生成一个列表,其中包含一个人可以从单词列表和拼贴集中拼写的所有可能单词.

I then create a function get_all_words which produces a list of all the possible words one can spell from a word list and a tile set.

def get_all_words(tile_set, words):
    # words is a word list
    return [i for i in words if word_in_tiles(i, tile_set)]

用于确定哪个图块集是最佳"的极其天真的方法三个字母如下:

The extremely naive approach for identifying which tileset is the "best" for three letters is the following:

我首先创建一个给定长度的所有可能组合的列表.所以对于长度 3,我会这样做:

I first create a list of every possible combination for a given length. So for length 3, I'd do:

import string
import itertools 

letters = string.ascii_lowercase
all_three_letter_combinations = list(itertools.combinations_with_replacement(letters, 3))

# Create a list of only words that are three letters are less
three_letter = [i for i in words if len(i) <= 3]

sequence_dict = dict()
    for i in all_three_letter_combinations:
        string_test = "".join(i).upper()
        sequence_dict[i] = get_all_words(string_test, three_letter)

然后去掉没有长度的值,按结果的长度排序:

Then remove the values with no length and sort by the length of the result:

res = {k: v for k, v in sequence_dict.items() if len(v) >= 1}

def GetMaxLength(res):
    return max((len(v), v, k) for k, v in res.items())[1:]

GetMaxLength(res)你知道,对于三个字母,产生最多英语有效词的图块集是 TAE,它可以产生以下词 ['AE', 'AT', 'ATE','吃','ET','ETA','TA','TAE','茶']

GetMaxLength(res) You get that, for three letters, the tile-set that produces the most english valid words is T A E which can produce the following words ['AE', 'AT', 'ATE', 'EAT', 'ET', 'ETA', 'TA', 'TAE', 'TEA']

我希望能够将其扩展到 N = 15.这样做的最佳程序是什么?

I'd like to be able to scale this up to as big as N = 15. What is the best procedure for doing this?

推荐答案

我觉得这已经足够了!

这是我在 PyPy 下运行的代码的日志:

Here is a log of my code running under PyPy:

0:00:00.000232
E
0:00:00.001251
ER
0:00:00.048733
EAT
0:00:00.208744
ESAT
0:00:00.087425
ESATL
0:00:00.132049
ESARTP
0:00:00.380296
ESARTOP
0:00:01.409129
ESIARTLP
0:00:03.433526
ESIARNTLP
0:00:10.391252
ESIARNTOLP
0:00:25.651012
ESIARNTOLDP
0:00:56.642405
ESIARNTOLCDP
0:01:57.257293
ESIARNTOLCDUP
0:03:55.933906
ESIARNTOLCDUPM
0:07:17.146036
ESIARNTOLCDUPMG
0:10:14.844347
ESIARNTOLCDUPMGH
0:13:34.722600
ESIARNTOLCDEUPMGH
0:18:14.215019
ESIARNTOLCDEUPMGSH
0:22:47.129284
ESIARNTOLCDEUPMGSHB
0:27:56.859511
ESIARNTOLCDEUPMGSHBYK
0:46:20.448502
ESIARNTOLCDEUPMGSHBYAK
0:57:15.213635
ESIARNTOLCDEUPMGSHIBYAT
1:09:55.530180
ESIARNTOLCDEUPMGSHIBYATF
1:18:35.209599
ESIARNTOLCDEUPMGSHIBYATRF
1:21:54.095119
ESIARNTOLCDEUPMGSHIBYATRFV
1:20:16.978411
ESIARNTOLCDEUPMGSHIBYAOTRFV
1:14:24.253660
ESIARNTOLCDEUPMGSHIBYAONTRFV
1:00:37.405571

关键的改进是这些.

  1. 我不仅区分字母,还区分这封信被看到的次数.因此,我可以接受或继续前进的每封信.这是我在评论 David Eisenstat 的解决方案时想到的一个想法.
  2. 从他那里,我还了解到修剪无法得出答案的树可以很好地控制问题的发展.
  3. 我看到的第一个解决方案就是所有顶部的字母.这是一个非常好的解决方案,因此尽管它是深度优先的,但我们会很好地修剪.
  4. 我很小心地整合用尽的尝试";成单条记录.这减少了我们必须丢弃的数据量.

这是代码.

import os
import datetime
path = "enable.txt"
words = []
with open(path) as f:
    for values in f:
        words.append(values.strip().upper())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely, so we can deal with that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.min_pos = -1
        self.max_pos = -1
        self.words = []
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
                if trie.max_pos < pos:
                    trie.max_pos = pos
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words.append(word)


base_trie = Trie('')
for word in words:
    base_trie.add_word(word);

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found
        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + len(include_trie.words),
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )
            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    count, subset = solve([], 0, 0, [(len(base_trie.words), base_trie.count_words, base_trie)])
    return ''.join([KEYS[x][0] for x in subset])

for i in range(20):
    start = datetime.datetime.now()
    print(best_solution(i))
    print(datetime.datetime.now() - start)

这篇关于算法:哪组长度为 N 的图块可用于生成最多数量的 Scrabble 有效单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆