距离矩阵的并行构造 [英] Parallel construction of a distance matrix

查看:92
本文介绍了距离矩阵的并行构造的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在大量多维向量上进行分层的聚类分析,我注意到最大的瓶颈是距离矩阵的构造.以下是该任务的一个简单的实现(在Python中):

I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in Python):

''' v = an array (N,d), where rows are the observations
and columns the dimensions'''
def create_dist_matrix(v):
   N = v.shape[0]
   D = np.zeros((N,N))
   for i in range(N):
      for j in range(i+1):
          D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance.cosine()
   return D

我想知道哪种是向此例程添加并行性的最佳方法.一种简单的方法是中断外循环并将其分配给许多作业,例如如果您有10个处理器,请为i的不同范围创建10个不同的作业,然后将结果串联起来.但是,这种水平"解决方案似乎不太正确.是否有其他并行算法(或现有库)用于此任务?任何帮助将不胜感激.

I was wondering which is the best way to add some parallelism to this routine. An easy way would be to break and assign the outer loop to a number of jobs, e.g. if you have 10 processors, create 10 different jobs for different ranges of i and then concatenate the results. However this "horizontal" solution doesn't seem quite right. Are there any other parallel algorithms (or existing libraries) for this task? Any help would be highly appreciated.

推荐答案

类似scikit-learn的并行pdist版本称为

Looks like scikit-learn has a parallel version of pdist called pairwise_distances

from sklearn.metrics.pairwise import pairwise_distances

D = pairwise_distances(X = v, metric = 'cosine', n_jobs = -1)

其中n_jobs = -1指定将使用所有CPU.

where n_jobs = -1 specifies that all CPUs will be used.

这篇关于距离矩阵的并行构造的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆