如何优化此Python代码以生成所有单词距离为1的单词? [英] How can I optimize this Python code to generate all words with word-distance 1?

查看:115
本文介绍了如何优化此Python代码以生成所有单词距离为1的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

配置文件显示这是我编写的一个小文字游戏中最慢的代码段:

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

注意:

  • distance()被调用了超过500万次,其中大部分来自getchildren,这应该使单词表中与word相差仅1个字母的所有单词.
  • wordlist经过预过滤,只包含与word相同数量的字母,因此可以保证word1word2具有相同数量的字符.
  • 我刚接触Python(三天前开始学习),因此对命名约定或其他样式问题的评论也很受欢迎.
  • 对于单词列表,请使用"2+"作为 12dict单词列表 2lemma.txt"文件
  • distance() is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from word by exactly 1 letter.
  • wordlist is pre-filtered to only have words containing the same number of letters as word so it's guaranteed that word1 and word2 have the same number of chars.
  • I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.
  • for wordlist, take the 12dict word list using the "2+2lemma.txt" file

结果:

感谢大家,通过不同建议的组合,我使程序现在运行的速度提高了两倍(在问之前我自己做了一些优化,所以速度比最初的实现快了4倍)

Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)

我测试了两组输入,分别称为A和B

I tested with 2 sets of inputs which I'll call A and B

优化1:从

Optimization1: iterate over indices of word1,2 ... from

for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

使用zip(word1, word2)

to iterate on letter-pairs using zip(word1, word2)

for x,y in zip (word1, word2):
        if x != y:
            difference += 1
    return difference

输入A的执行时间从11.92到9.18,输入B的执行时间从79.30到74.59

Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B

优化2: 除了距离方法(我在其他地方仍需要A *启发式方法)之外,还添加了一种单独的方法来解决差异.

Optimization2: Added a separate method for differs-by-one in addition to the distance-method (which I still needed elsewhere for the A* heuristics)

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in zip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

输入A的执行时间从9.18到8.83,输入B的执行时间从74.59到70.14

Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B

优化3: 这里最大的赢家是使用izip而不是zip

Optimization3: Big winner here was to use izip instead of zip

输入A的执行时间从8.83到5.02,输入B的执行时间从70.14到41.69

Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B

我可能会更好地用较低级别的语言编写它,但是现在我对此感到满意.谢谢大家!

I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!

再次更多结果 使用Mark的方法检查首字母不匹配的情况,将其从5.02-> 3.59和41.69-> 29.82

Edit again: More results Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82

在此基础上并并入izip而不是range ,我最终得出以下结论:

Building on that and incorporating izip instead of range, I ended up with this:

def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different

哪一个压缩得更多,时间从3.59-> 3.38和29.82-> 27.88

Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88

更多结果!

尝试Sumudu的建议,即我生成所有与单词"相距1个字母的字符串的列表,然后检查哪些单词在单词列表中,而不是结束于is_neighbor函数与此:

Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:

def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return ( w for w in oneoff if w in wordlist )

这最终变慢了(3.38-> 3.74和27.88-> 34.40),但似乎很有希望.起初,我认为我需要优化的部分是"one_letter_off_strings",但分析显示结果并非如此,而最慢的部分实际上是

Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact

( w for w in oneoff if w in wordlist )

我想如果我切换"oneoff"和"wordlist"是否会有任何区别,并且在我寻找两个列表的交集时碰到了另一种比较.我将其替换为字母上的 set-intersection :

I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with set-intersection on the letters:

return set(oneoff) & set(wordlist)

Bam! 3.74-> 0.23和34.40-> 2.25

Bam! 3.74 -> 0.23 and 34.40 -> 2.25

这真是太神奇了,与我最初的天真的实现相比,总的速度差异: 23.79-> 0.23和180.07-> 2.25,因此比原始实现快80到100倍.

This is truely amazing, total speed difference from my original naive implementation: 23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.

如果有人感兴趣,我发表了博客文章描述程序

If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).

大辩论:

好吧,我和未知者正在进行一场大辩论,您可以在 12dict单词列表使用"2 + 2lemma.txt"文件.抱歉,如果代码有点混乱,那只是我一起破解的东西.我也忘记了从单词表中去除逗号,所以实际上这是一个错误,您可以为了进行相同的比较而留在其中,或者通过在清除项中的字符列表中添加逗号来解决该错误.

Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.

from itertools import izip
def unique(seq):  
    seen = {} 
    result = [] 
    for item in seq: 
        if item in seen:
            continue 
        seen[item] = 1 
        result.append(item) 
    return result
def cleanentries(li):
    pass
    return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
    difference = 0
    for x,y in izip (word1, word2):
        if x != y:
            difference += 1
    return difference
def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different
def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
    import Queue
    closedset = []
    openset = [start]
    pqueue = Queue.PriorityQueue(0)
    g_score = {start:0}         #Distance from start along optimal path.
    h_score = {start:distance(start, goal)}
    f_score = {start:h_score[start]}
    pqueue.put((f_score[start], start))
    parent_dict = {}
    while len(openset) > 0:
        x = pqueue.get(False)[1]
        if x == goal:
            return reconstruct_path(parent_dict,goal)
        openset.remove(x)
        closedset.append(x)
        sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
        sortedOpen.sort()
        for y in getchildren(x, wordlist):
            if y in closedset:
                continue
            temp_g_score = g_score[x] + 1
            temp_is_better = False
            appended = False
            if (not y in openset): 
                openset.append(y)
                appended = True
                h_score[y] = distance(y, goal)
                temp_is_better = True
            elif temp_g_score < g_score[y] :
                temp_is_better = True
            else :
                pass
            if temp_is_better:
                parent_dict[y] = x
                g_score[y] = temp_g_score
                f_score[y] = g_score[y] + h_score[y]
                if appended :
                    pqueue.put((f_score[y], y))
    return None


def reconstruct_path(parent_dict,node):
     if node in parent_dict.keys():
         p = reconstruct_path(parent_dict,parent_dict[node])
         p.append(node)
         return p
     else:
         return []        

wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
    userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
    words = [w.lower() for w in userentry.split()]
    if(len(words) == 2 and len(words[0]) == len(words[1])):
        break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
    print "Minimum number of steps is %s" % (len(answer))
    reply = raw_input("Would you like the answer(y/n)? ")
    if(reply.lower() == "y"):
        answer.insert(0, words[0])
        print "\n".join(answer)
    else:
        print "Good luck!"
else:
    print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")

即使没有使用is_neighbors方法,我也将其保留了.这是建议移植到C的方法.要使用它,只需将getchildren替换为此:

I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:

def getchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w))

关于将它作为C模块工作,我没有走那么远,但这是我想到的:

As for getting it to work as a C module I didn't get that far, but this is what I came up with:

#include "Python.h"

static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
    int length;
    const char *word1, *word2;
    if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
        return NULL;

    int i;
    int different = 0;
    for (i =0; i < length; i++)
    {
        if (*(word1 + i) != *(word2 + i))
        {
            if (different)
            {
                return Py_BuildValue("i", different);
            }
            different = 1;
        }
    }
    return Py_BuildValue("i", different);
}

PyMethodDef methods[] = {
    {"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initIsNeighbor(void)
{
    Py_InitModule("isneighbor", methods);
}

我使用以下方法对此进行了分析:

I profiled this using:

python -m cProfile"Wordgame.py"

python -m cProfile "Wordgame.py"

记录的时间是AStar方法调用的总时间.快速输入集是诗人诗集",而长输入集是诗人诗集".时间显然在不同的机器之间会有所不同,因此,如果有人最终尝试这样做,则会对程序以及C模块的结果进行比较.

And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.

推荐答案

如果您的单词表很长,从单词"中生成所有可能的1个字母的差异,然后检查其中有哪些可能更有效.列表?我不知道任何Python,但应该为单词列表提供合适的数据结构,以便进行日志时间查找.

If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

之所以建议这样做,是因为如果您的单词长度合理(〜10个字母),那么您只会寻找250个潜在单词,如果您的单词列表大于几百个单词,这可能会更快.

I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.

这篇关于如何优化此Python代码以生成所有单词距离为1的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆