如何对 numpy 数组进行 n 维距离和最近邻计算 [英] How to do n-D distance and nearest neighbor calculations on numpy arrays
问题描述
此问题旨在成为规范的重复目标
给定两个形状为 (i, n)
和 (j, n)
的数组 X
和 Y
, 表示 n
维坐标的列表,
Given two arrays X
and Y
of shapes (i, n)
and (j, n)
, representing lists of n
-dimensional coordinates,
def test_data(n, i, j, r = 100):
X = np.random.rand(i, n) * r - r / 2
Y = np.random.rand(j, n) * r - r / 2
return X, Y
X, Y = test_data(3, 1000, 1000)
最快的查找方法是什么:
what are the fastest ways to find:
X
中的每个点与Y
中的每个点之间的距离D
,形状为(i,j)
>- 每个点的
k
个最近邻相对于X
中所有点的索引k_i
和距离k_d
在Y
- 距离
r内
X
中每个点的索引r_i
、r_j
和距离r_d
Y
中每一点
j
的- The distance
D
with shape(i,j)
between every point inX
and every point inY
- The indices
k_i
and distancek_d
of thek
nearest neighbors against all points inX
for every point inY
- The indices
r_i
,r_j
and distancer_d
of every point inX
within distancer
of every pointj
inY
鉴于以下几组限制:
- 仅使用
numpy
- 使用任何
python
包
包括特殊情况:
Y
是X
在所有情况下距离主要是指欧几里德距离,但可以随意突出显示允许其他距离计算的方法.
In all cases distance primarily means Euclidean distance, but feel free to highlight methods that allow other distance calculations.
推荐答案
1.所有距离
- 仅使用
numpy
最简单的方法是:
D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
然而,这会占用大量内存,创建一个 (i, j, n)
形状的中间矩阵,并且非常慢
However this takes up a lot of memory creating an (i, j, n)
-shaped intermediate matrix, and is very slow
然而,多亏了@Divakar(eucl_dist
包,wiki),我们可以使用一些代数和np.einsum
分解为如:(X - Y)**2 = X**2 - 2*X*Y + Y**2
However, thanks to a trick from @Divakar (eucl_dist
package, wiki), we can use a bit of algebra and np.einsum
to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2
D = np.sqrt( # (X - Y) ** 2
np.einsum('ij, ij ->i', X, X)[:, None] + # = X ** 2 \
np.einsum('ij, ij ->i', Y, Y) - # + Y ** 2 \
2 * X.dot(Y.T)) # - 2 * X * Y
Y
是X
与上面类似:
XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
请注意,使用此方法时,浮点不精确性可能会使对角线项略微偏离零.如果您需要确保它们为零,则需要明确设置它:
Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:
np.einsum('ii->i', D)[:] = 0
- 任何包裹
scipy.spatial.distance.cdist
是最直观的内置函数,比裸 numpy
from scipy.spatial.distance import cdist
D = cdist(X, Y)
cdist
还可以处理很多很多距离度量以及用户定义的距离度量(尽管这些都没有优化).有关详细信息,请查看上面链接的文档.
cdist
can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.
Y
是X
对于自引用距离,scipy.spatial.distance.pdist
与 cdist
的工作原理类似,但返回一个一维压缩距离数组,通过仅包含每个项来节省对称距离矩阵上的空间一次.您可以使用 squareform
For self-referring distances, scipy.spatial.distance.pdist
works similar to cdist
, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform
from scipy.spatial.distance import pdist, squareform
D_cond = pdist(X)
D = squareform(D_cond)
2.K 最近邻 (KNN)
- 仅使用
numpy
我们可以使用 np.argpartition
来获取 k-nearest
索引并使用这些索引来获取相应的距离值.因此,使用 D
作为保存上面获得的距离值的数组,我们将 -
We could use np.argpartition
to get the k-nearest
indices and use those to get the corresponding distance values. So, with D
as the array holding the distance values obtained above, we would have -
if k == 1:
k_i = D.argmin(0)
else:
k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)
但是,我们可以通过在减少数据集之前不取平方根来加快速度.np.sqrt
是计算欧几里得范数最慢的部分,所以我们不想直到最后才这样做.
However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt
is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
现在,np.argpartition
执行间接分区,不一定按排序顺序给我们元素,只确保第一个 k
元素是最小的.因此,对于排序输出,我们需要对上一步的输出使用 argsort
-
Now, np.argpartition
performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k
elements are the smallest ones. So, for a sorted output, we need to use argsort
on the output from previous step -
sorted_idx = k_d.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
如果你只需要,k_i
,你根本不需要平方根:
If you only need, k_i
, you never need the square root at all:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
X
是Y
在上面的代码中,替换:
In the above code, replace:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
与:
XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
- 任何包裹
KD-Tree 是一种查找邻居和约束距离的更快方法.请注意,虽然 KDTree 通常比上述 3d 的蛮力解决方案快得多(只要 oyu 超过 8 个点),如果您有 n
维,则 KDTree 只有在您有更多维度时才能很好地扩展比 2**n
点.有关高维度的讨论和更高级的方法,请参阅此处
KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n
-dimensions, KDTree only scales well if you have more than 2**n
points. For discussion and more advanced methods for high dimensions, see Here
最推荐的实现 KDTree 的方法是使用 scipy
的 scipy.spatial.KDTree
或 scipy.spatial.cKDTree
The most recommended method for implementing KDTree is to use scipy
's scipy.spatial.KDTree
or scipy.spatial.cKDTree
from scipy.spatial.distance import KDTree
X_tree = KDTree(X)
k_d, k_i = X_tree.query(Y, k = k)
不幸的是,scipy
的 KDTree 实现速度很慢,并且对于较大的数据集有段错误的倾向.正如@HansMusgrave 此处所指出的,pykdtree
提高了很多性能,但不像 scipy
那样常见,并且目前只能处理欧几里德距离(而scipy
中的KDTree
可以处理任意阶的Minkowsi p-范数)
Unfortunately scipy
's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree
increases the performance a lot, but is not as common an include as scipy
and can only deal with Euclidean distance currently (while the KDTree
in scipy
can handle Minkowsi p-norms of any order)
X
是Y
改为使用:
k_d, k_i = X_tree.query(X, k = k)
- 任意指标
BallTree 与 KDTree 具有相似的算法属性.我不知道 Python 中的并行/矢量化/快速 BallTree,但使用 scipy 我们仍然可以对用户定义的指标进行合理的 KNN 查询.如果可用,内置指标会更快.
A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.
def d(a, b):
return max(np.abs(a-b))
tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)
如果 d()
不是 度量.BallTree 比蛮力更快的唯一原因是因为度量的属性允许它排除某些解决方案.对于真正的任意函数,蛮力实际上是必要的.
This answer will be wrong if d()
is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.
- 仅使用
numpy
最简单的方法就是使用布尔索引:
The simplest method is just to use boolean indexing:
mask = D_sq < r**2
r_i, r_j = np.where(mask)
r_d = np.sqrt(D_sq[mask])
- 任何包裹
与上面类似,您可以使用scipy.spatial.KDTree.query_ball_point
Similar to above, you can use scipy.spatial.KDTree.query_ball_point
r_ij = X_tree.query_ball_point(Y, r = r)
或scipy.spatial.KDTree.query_ball_tree
Y_tree = KDTree(Y)
r_ij = X_tree.query_ball_tree(Y_tree, r = r)
不幸的是,r_ij
最终变成了一个索引数组列表,有些难以解开以备后用.
Unfortunately r_ij
ends up being a list of index arrays that are a bit difficult to untangle for later use.
使用cKDTree
的sparse_distance_matrix
容易多了,可以输出一个coo_matrix
Much easier is to use cKDTree
's sparse_distance_matrix
, which can output a coo_matrix
from scipy.spatial.distance import cKDTree
X_cTree = cKDTree(X)
Y_cTree = cKDTree(Y)
D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
r_i = D_coo.row
r_j = D_coo.column
r_d = D_coo.data
这是距离矩阵的一种非常灵活的格式,因为它仍然是一个实际矩阵(如果转换为 csr
)也可以用于许多矢量化操作.
This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr
) can also be used for many vectorized operations.
这篇关于如何对 numpy 数组进行 n 维距离和最近邻计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!