欧氏距离的高效精确计算 [英] Efficient and precise calculation of the euclidean distance

查看:70
本文介绍了欧氏距离的高效精确计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据一些在线研究(12, numpyscipy, scikit, 数学),我找到了几种计算Python 中的欧几里得距离的方法:

Following some online research (1, 2, numpy, scipy, scikit, math), I have found several ways for calculating the Euclidean Distance in Python:

# 1
numpy.linalg.norm(a-b)

# 2
distance.euclidean(vector1, vector2)

# 3
sklearn.metrics.pairwise.euclidean_distances  

# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)

# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))

# 6
math.hypot(x, y)

我想知道是否有人可以提供关于上述哪些(或我没有发现的任何其他)被认为在效率精度.如果有人知道讨论该主题的任何资源,那也会很棒.

I was wondering if someone could provide an insight on which of the above (or any other that I have not found) is considered the best in terms of efficiency and precision. If someone is aware of any resource(s) which discusses the subject that would also be great.

我感兴趣的上下文是计算数字元组对之间的欧几里得距离,例如(52, 106, 35, 12)(33, 153, 75, 10) 之间的距离.

The context I am interesting in is in calculating the Euclidean Distance between pairs of number-tuples, e.g. the distance between (52, 106, 35, 12) and (33, 153, 75, 10).

推荐答案

结论第一:

从使用timeit进行效率测试的结果来看,关于效率:

Conclusion first:

From the test result by using timeit for efficiency test, we can conclude that regarding the efficiency:

Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances)

Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances )

虽然我没有真正测试您的 Method4,因为它不适用于一般情况,并且通常等同于 Method5.

While I didn't really test your Method4 as it is not suitable for general cases and it is generally equivalent to Method5.

其余的,令人惊讶的是,Method5 是最快的.而对于使用 numpyMethod1,正如我们预期的那样,它在 C 中进行了大量优化,是第二快的.

For the rest, quite surprisingly, Method5 is the fastest one. While for Method1 which uses numpy, as what we expected, which is heavily optimized in C, is the second fastest.

对于scipy.spatial.distance,如果直接去函数定义,你会看到它实际上是在使用numpy.linalg.norm,除了它会在实际 numpy.linalg.norm 之前对两个输入向量执行验证.这就是为什么它比 numpy.linalg.norm 稍慢.

For scipy.spatial.distance, if you go directly to the function definition, you will see that it is actually using numpy.linalg.norm, except it will perform the validation on the two input vectors before the actual numpy.linalg.norm. That's why it is slightly slower thant numpy.linalg.norm.

最后是 sklearn,根据文档:

与其他计算距离的方法相比,此公式有两个优点.首先,它在处理稀疏数据时计算效率高.其次,如果一个参数发生变化而另一个参数保持不变,则可以预先计算 dot(x, x) 和/或 dot(y, y).然而,这并不是最精确的计算方式,而且这个函数返回的距离矩阵可能不是完全对称的

This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed. However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required

由于在您的问题中您想使用一组固定的数据,因此没有体现这种实现的优势.并且由于性能和精度之间的权衡,它也给出了所有方法中最差的精度.

Since in your question you would like to use a fixed set of data, the advantage of this implementation is not reflected. And due to the trade off between the performance and precision, it also gives the worst precision among all of the methods.

关于精度Method5=Metho1=Method2>Method3

Regarding the precision, Method5=Metho1=Method2>Method3

import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math

# 1
def eudis1(v1, v2):
    return np.linalg.norm(v1-v2)

# 2
def eudis2(v1, v2):
    return distance.euclidean(v1, v2)

# 3
def eudis3(v1, v2):
    return euclidean_distances(v1, v2)

# 5
def eudis5(v1, v2):
    dist = [(a - b)**2 for a, b in zip(v1, v2)]
    dist = math.sqrt(sum(dist))
    return dist

dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)

import timeit

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)

print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))

效率测试输出:

t1:  0.654838958307
t2:  1.53977598714
t3:  6.7898791732
t5:  0.422228400305

精密测试脚本&结果:

In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238

In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238

In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])

In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238

这篇关于欧氏距离的高效精确计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆