python中最快的成对距离度量 [英] Fastest pairwise distance metric in python

查看:45
本文介绍了python中最快的成对距离度量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个一维数字数组,想计算所有成对的欧几里德距离.我有一种方法(感谢 SO)通过广播来做到这一点,但效率低下,因为它计算每个距离两次.它不能很好地扩展.

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

这是一个示例,它通过一个包含 1000 个数字的数组为我提供了所需的内容.

Here's an example that gives me what I want with an array of 1000 numbers.

import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])

在 scipy/numpy/scikit-learn 中,我可以用来执行此操作的最快实现是什么,因为它必须扩展到一维数组具有 >10k 值的情况.

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

注意:矩阵是对称的,所以我猜通过解决这个问题可以获得至少 2 倍的加速,我只是不知道如何.

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

推荐答案

其他答案都没有完全回答问题 - 1 个在 Cython 中,一个更慢.但两者都提供了非常有用的提示.跟进他们表明 scipy.spatial.distance.pdist 是要走的路.

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.

这是一些代码:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

def option1(r):
    dists = np.abs(r - r[:, None])

def option2(r):
    dists = scipy.spatial.distance.pdist(r, 'cityblock')

def option3(r):
    dists = sklearn.metrics.pairwise.manhattan_distances(r)

使用 IPython 计时:

Timing with IPython:

In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop

In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop

In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop

我没有尝试 Cython 实现(我不能在这个项目中使用它),但是将我的结果与其他答案进行比较,它看起来像 scipy.spatial.distance.pdist 大约比 Cython 实现慢三分之一(通过对 np.abs 解决方案进行基准测试来考虑不同的机器).

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

这篇关于python中最快的成对距离度量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆