帮助优化Word搜索 [英] Help Optimizing Word Search

查看:55
本文介绍了帮助优化Word搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我刚刚玩过一些python代码我已经

得到了一个有趣的小优化问题我可以使用一些帮助。

基本上,该程序需要随机列出不超过
10个字母,并找到所有可能的突变,与我的

中的单词相匹配字典(80k字)。然而,通配符''?''也是一个可接受的字符,可以显着增加最坏的情况。

所以如果这些字母是[''''', ''b'',''c'']检查a,b,c,ab,ac,ba,bc,ca,

cb,abc,acb,bac,bca,cab,cba只有a,ba和cab将被添加到单词词典中。如果字母是[''?'',''?''],请检查az,aa,

ab,ac,ad,...,az,ba,bb,bc,bd, ...,zz


我正在使用trie结构来加载和查询我的字典,

如果找到单词则返回1,如果找到一个部分单词,则为4;如果没有可能的单词,则为3




我想我想知道是否有人能想到一个使用通配符更快地处理

,或许删除内部for循环或者其他完全不同的东西(迭代,不要使用trie?)。


欢迎任何其他建议


findword递归在9.16300010681秒运行

words.log只是我的清单单词:

[''hats'',''easts'',''baizes'',...,''sash'']


---代码开始---

PY> import trie

PY>进口时间

PY>

PY> print''loading trie ...'',

PY> mytrie = trie.Trie(''dicts.txt'')

PY>打印''完成。''

PY>

PY> alpha = list(''abcdefghijgklmnopqrstuvwxyz'')

PY>

PY> words = {}

PY> def findword(word,letters):

PY> #用完字母,递归停止点

PY>如果字母== []:

PY>返回

PY>

PY> if word!="":

PY> #检查单词是否有效

PY> result = mytrie.query(word)

PY>

PY> #找到一个字,将其添加到数据库中

PY>如果结果== 1:

PY>单词[word] = 1

PY>

PY> #没有其他单词以我当前的单词开头,递归停止



PY>如果结果== 3:

PY>返回

PY>

PY> #遍历我们列表中的每一个字母

PY>对于我,枚举字母(字母):

PY> #删除当前字母并递归

PY> newletters =字母[:]

PY> newletters.pop(i)

PY>

PY> #如果当前的信是?必须通过所有26封b
字母递减

PY> if letter ==''?'':

PY>对于alpha中的c:

PY> #recurse word +通配符

PY> findword(word + c,newletters)

PY>否则:

PY> #recurse word + letter

PY> findword(word + letter,newletters)

PY>

PY> letters = list(''abae?s?'')

PY> s = time.time()

PY> findword(",letters)

PY>打印time.time() - s

PY>

PY> output = open(''words.log'',''w'')

PY>打印>>输出,words.keys()

PY> output.close()

PY>

---代码结束---

谢谢

(希望如此谷歌没有吃掉我的空白)

Hi there I''ve just been playing around with some python code and I''ve
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter ''?'' is also an
acceptable character which increases the worst case time significantly.
So if the letters are [''a'',''b'',''c''] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are [''?'',''?''] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

I''m using a trie structure to load and query my dictionary, which
returns a 1 if the word is found, 4 if a partial word is found, and 3
if there is no possible word.

I guess I''m wondering if anyone could think of a faster way to deal
with the wildcards, perhaps removing the inner for-loop or else
something entirely different (iterative, don''t use the trie?).

Any other suggestions would be welcome

findword recursion runs in 9.16300010681 secs
words.log is simply my list of words:
[''hats'', ''easts'', ''baizes'',...,''sash'']

--- Code Begin ---
PY> import trie
PY> import time
PY>
PY> print ''loading trie ...'',
PY> mytrie = trie.Trie(''dicts.txt'')
PY> print ''done.''
PY>
PY> alpha = list(''abcdefghijgklmnopqrstuvwxyz'')
PY>
PY> words = {}
PY> def findword(word, letters):
PY> # run out of letters, recursion stop point
PY> if letters == []:
PY> return
PY>
PY> if word != "":
PY> # Check if word is valid
PY> result = mytrie.query(word)
PY>
PY> # Found a word, add it to the database
PY> if result == 1:
PY> words[word] = 1
PY>
PY> # No other word starts with my current word, recursion stop
point
PY> if result == 3:
PY> return
PY>
PY> # Run through every letter in our list
PY> for i,letter in enumerate(letters):
PY> # Remove current letter and recurse
PY> newletters = letters[:]
PY> newletters.pop(i)
PY>
PY> # if current letter is ? must recurse through all 26
letters
PY> if letter == ''?'':
PY> for c in alpha:
PY> # recurse word+wildcard letter
PY> findword(word+c,newletters)
PY> else:
PY> # recurse word+letter
PY> findword(word+letter,newletters)
PY>
PY> letters = list(''abae?s?'')
PY> s = time.time()
PY> findword("",letters)
PY> print time.time() - s
PY>
PY> output = open(''words.log'',''w'')
PY> print >> output, words.keys()
PY> output.close()
PY>
--- Code End ---
Thanks
(Hopefully google doesn''t eat my whitespace)

推荐答案

Case Nelson写道:
Case Nelson wrote:
你好我我刚刚玩了一些python代码,我有一个有趣的小优化问题我可以使用一些帮助。

基本上,程序需要随机接受不超过10个字母的列表,找到与我的字典中的单词匹配的所有可能的突变(80k字)。然而,通配符''?''也是一个可接受的字符,可以显着增加最坏情况下的
。因此,如果字母是['''',''b'',''c''],请检查a,b,c,ab,ac,ba,bc,
ca,cb,abc,acb ,bac,bca,cab,cba,其中只有a,ba和cab将被添加到单词的词典中。如果字母是[''?'',''?''],请检查az,
aa,ab,ac,ad,...,az,ba,bb,bc,bd,..., zz
Hi there I''ve just been playing around with some python code and I''ve
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter ''?'' is also an
acceptable character which increases the worst case time significantly. So if the letters are [''a'',''b'',''c''] check a, b, c, ab, ac, ba, bc, ca, cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are [''?'',''?''] check a-z, aa, ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz




这似乎是计算机科学101数据结构和
算法问题,而不是Python问题,但这里是'回答

无论如何:


您似乎想要找到所有包含一个或多个字母的单词

与您的查询相同或候选字符串。


旁白:您是否一直关注海报开始的长线程

似乎想要存储所有可能的字符串_not_

给定语言中的单词但可以从其字母表中生成?


这里有可能:使用位掩码方法。你在每个单词附加一个位掩码

;简单的数据结构 - 一个2元组的列表,或两个平行的

列表。


!def mask(word):

! m = 0

!用字来形容:

! m | = 1<< (ord(letter) - ord(''a''))

!返回m


搜索没有通配符:


!def nowc_search(candidate,mylistof2tuples):

! candmask =面具(候选人)#将候选人视为str,而不是列表

!换句话说,mylistof2tuples中的掩码:

!如果面具& candmask:

! #一个或多个共同的字母

!产生的话


注意:这对待密西西比和misp相同。如果aa在你的

字典中,哪些查询会检索它?根据您的确切

要求,这种技术可能适合您,或者您可能希望将其用作

a fast(?)过滤器,并且需要进一步使用它所引发的匹配

检查。您可能需要一个在int中设置的计数位数。

函数。

参考:Fred J. Damerau,A技术用于计算机检测和

纠正拼写错误,CACM第7卷第3期,1961年3月。


使用外卡搜索:您的查询示例== " ??"所有两个字母的单词似乎都会产生

。我想看看你对a?,?a,

" ab?"和aa?的期望。在建议如何处理外卡之前。

其他人的代码中的逆向工程要求不是我为了好玩而做的事情:-)


您可以选择 :而不是位掩码,存储

字母排序的单词变换:cat-> act ,行动 - >行动,

dog-> dgo,god-> dgo。


替代数据结构:key = bitmask或sorted-letters ,value =

具有该密钥的所有单词的列表。


设置时应始终考虑进一步的建议

a搜索时间比平均值更差O(1):为每个不同的字长有一个单独的

字典,或者其他一些

不可能长度避免过滤器;这样,用最少的初步

计算,你可以避免考虑那么长的话语

简短,他们不可能是匹配。例如,使用

基于编辑距离的近似匹配,如果您正在搜索允许2个错误的
10个字母的单词,则可以避免执行

比较短于8或超过12的单词。

HTH,

John



This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here''s an answer
anyway:

You appear to want to find all words that have one or more letters in
common with your query or candidate string.

Aside: Have you been following the long thread started by the poster
who appeared to want to store all possible strings that were _not_
words in a given language but could be generated from its alphabet?

Here''s a possibility: use a bit mask approach. You attach a bit mask to
each word; simple data structure -- a list of 2-tuples, or two parallel
lists.

!def mask(word):
! m = 0
! for letter in word:
! m |= 1 << (ord(letter) - ord(''a''))
! return m

Searching without wildcards:

!def nowc_search(candidate, mylistof2tuples):
! candmask = mask(candidate) # treating candidate as str, not list
! for word, mask in mylistof2tuples:
! if mask & candmask:
! # one or more letters in common
! yield word

Note: this treats "mississippi" and "misp" the same. If "aa" is in your
dictionary, what queries would retrieve it? Depending on your exact
requirements, this technique may suit you, or you may want to use it as
a fast(?) filter, with the matches it throws up needing further
checking. You may need a "count number of bits that are set in an int"
function.

Ref: Fred J. Damerau, "A technique for computer detection and
correction of spelling errors", CACM vol 7 number 3, March 1961.

Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I''d like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks'' code is not
something I do for fun :-)

An alternative for you to think about: instead of a bitmask, store the
letter-sorted transformation of the words: cat->act, act->act,
dog->dgo, god->dgo.

Alternative data structure: key = bitmask or sorted-letters, value =
list of all words that have that key.

A further suggestion which should always be considered when setting up
a search where the timing is worse than average O(1): have a separate
dictionary for each different wordlength, or some other
impossible-length-avoidance filter; that way, with minimum preliminary
calculation you can avoid considering words that are so long or so
short that they cannot possibly be matches. For example, with
approximate matching based on edit distance, if you are searching for a
10-letter word allowing for 2 errors, you can avoid doing the
complicated comparison on words shorter than 8 or longer than 12.
HTH,
John


" Case Nelson" < CA ********* @ gmail.com>写道:
"Case Nelson" <ca*********@gmail.com> writes:
基本上,程序需要接收一个不超过10个字母的随机列表,并找到与我的字典中的单词匹配的所有可能的突变(80k话)。然而,通配符''?''也是一个可接受的字符,它会显着增加最坏的情况。
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter ''?'' is also an
acceptable character which increases the worst case time significantly.




对于那个大小模式和字典,只需将模式编译为
a regexp,将字典加入一个大字符串(abc

def ghijk ...),并将正则表达式与大字符串相匹配字符串,可能比使用
python中完全编码的一些奇特算法快得多。



For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.




Paul Rubin写道:

Paul Rubin wrote:
" Case Nelson" < CA ********* @ gmail.com>写道:
"Case Nelson" <ca*********@gmail.com> writes:
基本上,程序需要接收一个不超过10个字母的
的随机列表,并找到所有可能的突变,这些突变与
中的单词相匹配我的字典(80k话)。然而,通配符''?''也是一个可接受的字符,它会显着增加最坏情况时间
Basically, the program needs to take in a random list of no more than 10 letters, and find all possible mutations that match a word in my dictionary (80k words). However a wildcard letter ''?'' is also an
acceptable character which increases the worst case time



对于那个大小模式和字典,只需将模式编译为
正则表达式,将字典连接成一个大字符串(abc
def ghijk ...),并将正则表达式与大字符串相匹配,可能比使用一些完全编码的奇特算法在
python中。

significantly.
For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.




保罗,因为OP似乎想要一些类似于任何字符的内容

(per)突变他的查询字符串的任何子字符串 - 并且在

之前考虑通配符 - 我想看一个正则表达式的例子

可以处理。

干杯,

John



Paul, given the OP appears to want something like words that match any
(per)mutation of any substring of his query string -- and that''s before
factoring in wildcards -- I''d like to see an example of a regexp that
could handle that.
Cheers,
John


这篇关于帮助优化Word搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆