在给定汉明距离的情况下有效地构建单词图 [英] Efficiently build a graph of words with given Hamming distance

查看:108
本文介绍了在给定汉明距离的情况下有效地构建单词图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用(a)为1的 Hamming distance 的单词列表构建图形,或者将不同的是,如果两个单词仅与一个字母不同( lo l -> lo t ),则将两个单词连接起来。

I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lol -> lot).

所以给定

words = [大声笑,很多,机器人]

图形将为

{
  'lol' : [ 'lot' ],
  'lot' : [ 'lol', 'bot' ],
  'bot' : [ 'lot' ]
}

最简单的方法是将列表中的每个单词相互比较并计算不同的字符;可悲的是,这是一个 O(N ^ 2)算法。

The easy way is to compare every word in the list with every other and count the different chars; sadly, this is a O(N^2) algorithm.

我可以使用哪种算法/ ds / strategy以获得更好的性能?

Which algo/ds/strategy can I use to to achieve better performance?

此外,我们假设仅使用拉丁字符,并且所有单词的长度相同。

Also, let's assume only latin chars, and all the words have the same length.

推荐答案

假定您将字典存储在 set()中,这样平均查找值为 O(1)(最坏情况为 O(n)

Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

您可以生成距单词汉明距离1的所有有效单词:

You can generate all the valid words at hamming distance 1 from a word:

>>> def neighbours(word):
...     for j in range(len(word)):
...         for d in string.ascii_lowercase:
...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
...             if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

如果 M 是一个单词的长度, L 字母的长度(即26),使用这种方法查找相邻单词的最坏情况的时间复杂度为 O(L * M * N)

If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

简便方法方法的时间复杂度为 O(N ^ 2)

The time complexity of the "easy way" approach is O(N^2).

这种方法何时更好?当 L * M< N ,即如果仅考虑小写字母,则 M< N / 26 。 (我在这里只考虑了最坏的情况)

When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

注意:英语单词的平均长度为5.1个字母。因此,如果您的字典大小大于132个单词,则应考虑采用这种方法。

Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

可能有可能实现比这更好的性能。

Probably it is possible to achieve better performance than this. However this was really simple to implement.

简便方法算法( A1 ):

from itertools import zip_longest
def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

此算法( A2 ):

def graph2(words): return {word: list(neighbours(word)) for word in words}

基准代码:

for dict_size in range(100,6000,100):
    words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
    t1 = Timer(lambda: graph1()).timeit(10)
    t2 = Timer(lambda: graph2()).timeit(10)
    print('%d,%f,%f' % (dict_size,t1,t2))

Ou输出:

100,0.119276,0.136940
200,0.459325,0.233766
300,0.958735,0.325848
400,1.706914,0.446965
500,2.744136,0.545569
600,3.748029,0.682245
700,5.443656,0.773449
800,6.773326,0.874296
900,8.535195,0.996929
1000,10.445875,1.126241
1100,12.510936,1.179570
...

我用另一个较小的N值运行了一个基准,以使其更接近:

I ran another benchmark with smaller steps of N to see it closer:

10,0.002243,0.026343
20,0.010982,0.070572
30,0.023949,0.073169
40,0.035697,0.090908
50,0.057658,0.114725
60,0.079863,0.135462
70,0.107428,0.159410
80,0.142211,0.176512
90,0.182526,0.210243
100,0.217721,0.218544
110,0.268710,0.256711
120,0.334201,0.268040
130,0.383052,0.291999
140,0.427078,0.312975
150,0.501833,0.338531
160,0.637434,0.355136
170,0.635296,0.369626
180,0.698631,0.400146
190,0.904568,0.444710
200,1.024610,0.486549
210,1.008412,0.459280
220,1.056356,0.501408
...

您看到的折衷是非常低的(对于长度为3的单词字典为100)。对于小型词典,O(N ^ 2)算法的执行稍好,但是随着N的增长,O(LMN)算法很容易被击败。

You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

对于具有较长单词的字典,O(LMN)算法在N中保持线性,只是斜率不同,因此权衡会稍微向右移动(长度= 5时为130)。

For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

这篇关于在给定汉明距离的情况下有效地构建单词图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆