寻找快速的方法来计算许多字符串的成对距离 [英] looking for fast way to compute pair wise distances of many strings

查看:48
本文介绍了寻找快速的方法来计算许多字符串的成对距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个〜100万个唯一的16个字符的字符串(称为VEC的数组)的列表,我想计算Python中每个字符串的最小成对汉明距离(称为RES的数组).基本上,我每次只计算一行完整的成对距离矩阵,但只将每行中的最小值存储在RES中.

I have a list of ~1 million unique 16-character strings (an array called VEC) and I want to calculate the minimum pair-wise hamming distance for each one in Python (an array called RES). Basically, I'm calculating the full pair-wise distance matrix one row at a time but only storing the minimum value in RES for each row.

VEC= ['AAAAAAAAAAAAAAAA','AAAAAAAAAAAAAAAT','AAAAGAAAAAATAAAA'...]

使得dist(VEC [1],VEC [2])= 1,dist(VEC [1],VEC [3])= 2等...而RES [1] = 1.使用这些页面上的提示和技巧,我想到了:

so that dist(VEC[1],VEC[2])=1, dist(VEC[1],VEC[3])=2 etc... and RES[1]=1. Using tips and tricks from these pages I came up with:

#METHOD#1:
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
i=0
for a in VEC:
    dist=numpy.array([Levenshtein.hamming(a,b) for b in VEC] ) #array of distances
    RES[i]=numpy.amin(dist[dist>0])  #pick min distance greater than zero
    i+=1

缩短的VEC仅为10,000,耗时约70秒,但如果我将其推算为整数,则需要8天.因为我要重新计算距离矩阵的对称部分,所以我的方法似乎很浪费,因此在尝试更新行的RES的同时,我尝试计算矩阵的一半:

a shortened VEC of only 10,000 took about 70 sec, but if I extrapolate that to the full million it will take 8 days. My approach seems wasteful since I'm recalculating the symmetric parts of the distance matrix so I tried to calculate half of the matrix while updating RES for each row as I went along:

#METHOD #2:
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
for i in range(len(VEC)-1):
    dist=[Levenshtein.hamming(VEC[i],VEC[j]) for j in range(i+1, len(VEC))]
    RES[i]=min(numpy.amin(dist),RES[i])
    #update RES as you go along:
    k=0
    for j in range(i+1,len(VEC)):
        if dist[k]<RES[j]:
             RES[j]=dist[k]
        k+=1

第二种方法可能花费几乎两倍的时间(117秒),这并不奇怪,因此它不是很好.无论如何,任何人都可以建议改进/更改以使其更快吗?

Probably not surprisingly, this 2nd approach takes almost twice as long (117 sec) so it isn't very good. Regardless, can anyone recommend improvements/changes to make this faster?

推荐答案

如果您只需要每个邻居的最近邻居(忽略自己),而您可以逃脱的机会很小要获得近似的最近邻居,您可以考虑为汉明距离实现位采样"局部敏感哈希.简而言之,创建三个哈希表.从每个128位输入中,使用这16位采样作为键,对16位采样3次.哈希表的值应该是具有该采样密钥的所有128位输入的列表.将所有一百万个输入都放入LSH索引后,只需:

If you only need the nearest neighbor for each bitarry (ignoring itself), and you could get away with a tiny chance of only getting an approximate nearest neighbor, you might consider implementing the "Bit Sampling" Locality Sensitive Hash for Hamming distance. In a nutshell, create three hash tables. From each 128-bit input, sample 16 bits, 3 times, using those 16 bit samples as keys. The values of your hash tables should be a list of all 128-bit inputs that had that sampled key. Once you place all million of your inputs into the LSH index, simply:

  • 迭代一百万个积分
  • 对于每个输入,请执行上述3采样
  • 在三个列表(距离> 0)中找到最近的邻居,并保持最佳状态

加载和测试都非常快.我可能会推荐出色的 bitarray 库作为基础.

Both loading and testing are ludicrously quick. I might recommend the excellent bitarray library for underpinning this.

这篇关于寻找快速的方法来计算许多字符串的成对距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆