计算 N 个样本和簇质心之间的平方欧几里德距离的最有效方法是什么? [英] What is the most efficient way to compute the square euclidean distance between N samples and clusters centroids?

查看:27
本文介绍了计算 N 个样本和簇质心之间的平方欧几里德距离的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种有效的方法(没有 for 循环)来计算一组样本和一组簇质心之间的欧几里德距离.

I am looking for an efficient way (no for loops) to compute the euclidean distance between a set of samples and a set of clusters centroids.

示例:

import numpy as np
X = np.array([[1,2,3],[1, 1, 1],[0, 2, 0]])
y = np.array([[1,2,3], [0, 1, 0]])

预期输出:

array([[ 0., 11.],
       [ 5.,  2.],
       [10.,  1.]])

这是 X 中每个样本到 y 中每个质心之间的平方欧几里德距离.

This is the squared euclidean distance between each sample in X to each centroid in y.

我想出了两个解决方案:

I came up with 2 solutions:

解决方案 1:

def dist_2(X,y):
    X_square_sum = np.sum(np.square(X), axis = 1)
    y_square_sum = np.sum(np.square(y), axis = 1)
    dot_xy = np.dot(X, y.T)
    X_square_sum_tile = np.tile(X_square_sum.reshape(-1, 1), (1, y.shape[0]))
    y_square_sum_tile = np.tile(y_square_sum.reshape(1, -1), (X.shape[0], 1))
    dist = X_square_sum_tile + y_square_sum_tile - (2 * dot_xy)
    return dist

dist = dist_2(X, y)

解决方案 2:

import scipy
dist = scipy.spatial.distance.cdist(X,y)**2

两种解决方案的性能(挂钟时间)

import time
X = np.random.random((100000, 50))
y = np.random.random((100, 50))

start = time.time()
dist = scipy.spatial.distance.cdist(X,y)**2
end = time.time()
print (end - start)

平均经过的挂钟时间 = 0.7 秒

Average elapsed wall-clock time = 0.7 sec

start = time.time()
dist = dist_2(X,y)
end = time.time()
print (end - start)

平均经过的挂钟时间 = 0.3 秒

Average elapsed wall-clock time = 0.3 sec

在大量质心上进行测试

X = np.random.random((100000, 50))
y = np.random.random((1000, 50))

解决方案 1"的平均挂钟时间 = 50 秒(+ 内存问题)

Average elapsed wall-clock time of "solution 1" = 50 sec (+ Memory issue)

解决方案 2"的平均挂钟时间 = 6 秒!!!

Average elapsed wall-clock time of "solution 2" = 6 sec !!!

结论

似乎解决方案 1 比解决方案 2"在平均经过的挂钟时间(在小数据集上)更有效,但在内存方面效率低下.

It seems that "solution 1 is more efficient than "solution 2" with respect to the average elapsed wall-clock time (on small data-sets) but inefficient with respect to memory.

有什么建议吗?

推荐答案

这个问题经常与近邻搜索结合使用.如果是这种情况,请查看 kdtree 方法.无论是在内存消耗还是性能方面,这都比计算欧几里得距离更有效.

This question is frequently asked in combination with nereast-neighbour search. If this is the case have a look at a kdtree approach. This will be far more efficient than calculating euclidian distances, both in memory consumption and performance.

如果不是这种情况,这里有一些可能的方法.前两个来自Divakar 的回答.第三个使用 Numba 进行 jit 编译.与前两个版本的主要区别在于避免使用临时数组.

If this is not the case, here are some possible approaches. The first two are from an answer of Divakar. The third one uses Numba for jit-compilation. The main difference to the first two versions is he avoidance of temporary arrays.

计算欧几里得距离的三种方法

import numpy as np
import numba as nb

# @Paul Panzer
#https://stackoverflow.com/a/42994680/4045774
def outer_sum_dot_app(A,B):
    return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)

# @Divakar
#https://stackoverflow.com/a/42994680/4045774
def outer_einsum_dot_app(A,B):
    return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)

@nb.njit()
def calc_dist(A,B,sqrt=False):
  dist=np.dot(A,B.T)

  TMP_A=np.empty(A.shape[0],dtype=A.dtype)
  for i in range(A.shape[0]):
    sum=0.
    for j in range(A.shape[1]):
      sum+=A[i,j]**2
    TMP_A[i]=sum

  TMP_B=np.empty(B.shape[0],dtype=A.dtype)
  for i in range(B.shape[0]):
    sum=0.
    for j in range(B.shape[1]):
      sum+=B[i,j]**2
    TMP_B[i]=sum

  if sqrt==True:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=np.sqrt(-2.*dist[i,j]+TMP_A[i]+TMP_B[j])
  else:
    for i in range(A.shape[0]):
      for j in range(B.shape[0]):
        dist[i,j]=-2.*dist[i,j]+TMP_A[i]+TMP_B[j]
  return dist

时间

A = np.random.randn(10000,3)
B = np.random.randn(10000,3)

#calc_dist:                      360ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 1150ms
#outer_sum_dot_app (Paul Panzer):1590ms
#dist_2:                         1840ms

A = np.random.randn(1000,100)
B = np.random.randn(1000,100)

#calc_dist:                      4.3  ms first call excluded due to compilation overhead
#outer_einsum_dot_app (Divakar): 13.12ms
#outer_sum_dot_app (Paul Panzer):13.2 ms
#dist_2:                         21.3ms

这篇关于计算 N 个样本和簇质心之间的平方欧几里德距离的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆