在给定汉明距离的情况下有效地构建单词图 [英] Efficiently build a graph of words with given Hamming distance

查看：108 发布时间：2020/6/3 20:35:23 python algorithm graph-algorithm hamming-distance

本文介绍了在给定汉明距离的情况下有效地构建单词图的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用（a）为1的 Hamming distance 的单词列表构建图形，或者将不同的是，如果两个单词仅与一个字母不同（ lo l -> lo t ），则将两个单词连接起来。

I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lol -> lot).

所以给定

words = [大声笑，很多，机器人]

图形将为

{
  'lol' : [ 'lot' ],
  'lot' : [ 'lol', 'bot' ],
  'bot' : [ 'lot' ]
}

最简单的方法是将列表中的每个单词相互比较并计算不同的字符；可悲的是，这是一个 O（N ^ 2）算法。

The easy way is to compare every word in the list with every other and count the different chars; sadly, this is a O(N^2) algorithm.

我可以使用哪种算法/ ds / strategy以获得更好的性能？

Which algo/ds/strategy can I use to to achieve better performance?

此外，我们假设仅使用拉丁字符，并且所有单词的长度相同。

Also, let's assume only latin chars, and all the words have the same length.

推荐答案

假定您将字典存储在 set（）中，这样平均查找值为 O（1）（最坏情况为 O（n））。

Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

您可以生成距单词汉明距离1的所有有效单词：

You can generate all the valid words at hamming distance 1 from a word:

>>> def neighbours(word):
...     for j in range(len(word)):
...         for d in string.ascii_lowercase:
...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
...             if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

如果 M 是一个单词的长度， L 字母的长度（即26），使用这种方法查找相邻单词的最坏情况的时间复杂度为 O（L * M * N）

If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

简便方法方法的时间复杂度为 O（N ^ 2）。

The time complexity of the "easy way" approach is O(N^2).

这种方法何时更好？当 L * M< N ，即如果仅考虑小写字母，则 M< N / 26 。（我在这里只考虑了最坏的情况）

When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

注意：英语单词的平均长度为5.1个字母。因此，如果您的字典大小大于132个单词，则应考虑采用这种方法。

Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

可能有可能实现比这更好的性能。

Probably it is possible to achieve better performance than this. However this was really simple to implement.

简便方法算法（ A1 ）：

from itertools import zip_longest def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2)) def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

此算法（ A2 ）：

def graph2(words): return {word: list(neighbours(word)) for word in words}

基准代码：

for dict_size in range(100,6000,100): words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)]) t1 = Timer(lambda: graph1()).timeit(10) t2 = Timer(lambda: graph2()).timeit(10) print('%d,%f,%f' % (dict_size,t1,t2))

Ou输出：

100,0.119276,0.136940 200,0.459325,0.233766 300,0.958735,0.325848 400,1.706914,0.446965 500,2.744136,0.545569 600,3.748029,0.682245 700,5.443656,0.773449 800,6.773326,0.874296 900,8.535195,0.996929 1000,10.445875,1.126241 1100,12.510936,1.179570 ...

我用另一个较小的N值运行了一个基准，以使其更接近：

I ran another benchmark with smaller steps of N to see it closer:

10,0.002243,0.026343 20,0.010982,0.070572 30,0.023949,0.073169 40,0.035697,0.090908 50,0.057658,0.114725 60,0.079863,0.135462 70,0.107428,0.159410 80,0.142211,0.176512 90,0.182526,0.210243 100,0.217721,0.218544 110,0.268710,0.256711 120,0.334201,0.268040 130,0.383052,0.291999 140,0.427078,0.312975 150,0.501833,0.338531 160,0.637434,0.355136 170,0.635296,0.369626 180,0.698631,0.400146 190,0.904568,0.444710 200,1.024610,0.486549 210,1.008412,0.459280 220,1.056356,0.501408 ...

您看到的折衷是非常低的（对于长度为3的单词字典为100）。对于小型词典，O（N ^ 2）算法的执行稍好，但是随着N的增长，O（LMN）算法很容易被击败。

You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

对于具有较长单词的字典，O（LMN）算法在N中保持线性，只是斜率不同，因此权衡会稍微向右移动（长度= 5时为130）。

For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

这篇关于在给定汉明距离的情况下有效地构建单词图的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在给定汉明距离的情况下有效地构建单词图 [英] Efficiently build a graph of words with given Hamming distance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在给定汉明距离的情况下有效地构建单词图 [英] Efficiently build a graph of words with given Hamming distance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭