如何对 numpy 数组进行 n 维距离和最近邻计算 [英] How to do n-D distance and nearest neighbor calculations on numpy arrays

查看:30
本文介绍了如何对 numpy 数组进行 n 维距离和最近邻计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题旨在成为规范的重复目标

给定两个形状为 (i, n)(j, n) 的数组 XY, 表示 n 维坐标的列表,

Given two arrays X and Y of shapes (i, n) and (j, n), representing lists of n-dimensional coordinates,

def test_data(n, i, j, r = 100):
    X = np.random.rand(i, n) * r - r / 2
    Y = np.random.rand(j, n) * r - r / 2
    return X, Y

X, Y = test_data(3, 1000, 1000)

最快的查找方法是什么:

what are the fastest ways to find:

  1. X中的每个点与Y中的每个点之间的距离D,形状为(i,j)>
  2. 每个点的k个最近邻相对于X中所有点的索引k_i和距离k_dY
  3. 距离r内X中每个点的索引r_ir_j和距离r_dY
  4. 中每一点j
  1. The distance D with shape (i,j) between every point in X and every point in Y
  2. The indices k_i and distance k_d of the k nearest neighbors against all points in X for every point in Y
  3. The indices r_i, r_j and distance r_d of every point in X within distance r of every point j in Y

鉴于以下几组限制:

  • 仅使用 numpy
  • 使用任何 python

包括特殊情况:

  • YX

在所有情况下距离主要是指欧几里德距离,但可以随意突出显示允许其他距离计算的方法.

In all cases distance primarily means Euclidean distance, but feel free to highlight methods that allow other distance calculations.

推荐答案

1.所有距离

  • 仅使用 numpy
  • 最简单的方法是:

    D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
    

    然而,这会占用大量内存,创建一个 (i, j, n) 形状的中间矩阵,并且非常慢

    However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow

    然而,多亏了@Divakar(eucl_dist 包,wiki),我们可以使用一些代数和np.einsum 分解为如:(X - Y)**2 = X**2 - 2*X*Y + Y**2

    However, thanks to a trick from @Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2

    D = np.sqrt(                                #  (X - Y) ** 2   
    np.einsum('ij, ij ->i', X, X)[:, None] +    # = X ** 2        \
    np.einsum('ij, ij ->i', Y, Y)          -    # + Y ** 2        \
    2 * X.dot(Y.T))                             # - 2 * X * Y
    

    • YX
    • 与上面类似:

      XX = np.einsum('ij, ij ->i', X, X)
      D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
      

      请注意,使用此方法时,浮点不精确性可能会使对角线项略微偏离零.如果您需要确保它们为零,则需要明确设置它:

      Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:

      np.einsum('ii->i', D)[:] = 0 
      

      • 任何包裹
      • scipy.spatial.distance.cdist 是最直观的内置函数,比裸 numpy

        from scipy.spatial.distance import cdist
        D = cdist(X, Y)
        

        cdist 还可以处理很多很多距离度量以及用户定义的距离度量(尽管这些都没有优化).有关详细信息,请查看上面链接的文档.

        cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.

        • YX

        对于自引用距离,scipy.spatial.distance.pdistcdist 的工作原理类似,但返回一个一维压缩距离数组,通过仅包含每个项来节省对称距离矩阵上的空间一次.您可以使用 squareform

        For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform

        from scipy.spatial.distance import pdist, squareform
        D_cond = pdist(X)
        D = squareform(D_cond)
        

        2.K 最近邻 (KNN)

        • 仅使用 numpy
        • 我们可以使用 np.argpartition 来获取 k-nearest 索引并使用这些索引来获取相应的距离值.因此,使用 D 作为保存上面获得的距离值的数组,我们将 -

          We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have -

          if k == 1:
              k_i = D.argmin(0)
          else:
              k_i = D.argpartition(k, axis = 0)[:k]
          k_d = np.take_along_axis(D, k_i, axis = 0)
          

          但是,我们可以通过在减少数据集之前不取平方根来加快速度.np.sqrt 是计算欧几里得范数最慢的部分,所以我们不想直到最后才这样做.

          However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.

          D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                 np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
          if k == 1:
              k_i = D_sq.argmin(0)
          else:
              k_i = D_sq.argpartition(k, axis = 0)[:k]
          k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
          

          现在,np.argpartition 执行间接分区,不一定按排序顺序给我们元素,只确保第一个 k 元素是最小的.因此,对于排序输出,我们需要对上一步的输出使用 argsort -

          Now, np.argpartition performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step -

          sorted_idx = k_d.argsort(axis = 0)
          k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
          k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
          

          如果你只需要,k_i,你根本不需要平方根:

          If you only need, k_i, you never need the square root at all:

          D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                 np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
          if k == 1:
              k_i = D_sq.argmin(0)
          else:
              k_i = D_sq.argpartition(k, axis = 0)[:k]
          k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
          sorted_idx = k_d_sq.argsort(axis = 0)
          k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
          

          • XY
          • 在上面的代码中,替换:

            In the above code, replace:

            D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                   np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
            

            与:

            XX = np.einsum('ij, ij ->i', X, X)
            D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
            

            • 任何包裹
            • KD-Tree 是一种查找邻居和约束距离的更快方法.请注意,虽然 KDTree 通常比上述 3d 的蛮力解决方案快得多(只要 oyu 超过 8 个点),如果您有 n 维,则 KDTree 只有在您有更多维度时才能很好地扩展比 2**n 点.有关高维度的讨论和更高级的方法,请参阅此处

              KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here

              最推荐的实现 KDTree 的方法是使用 scipyscipy.spatial.KDTreescipy.spatial.cKDTree

              The most recommended method for implementing KDTree is to use scipy's scipy.spatial.KDTree or scipy.spatial.cKDTree

              from scipy.spatial.distance import KDTree
              X_tree = KDTree(X)
              k_d, k_i = X_tree.query(Y, k = k)
              

              不幸的是,scipy 的 KDTree 实现速度很慢,并且对于较大的数据集有段错误的倾向.正如@HansMusgrave 此处所指出的,pykdtree 提高了很多性能,但不像 scipy 那样常见,并且目前只能处理欧几里德距离(而scipy中的KDTree可以处理任意阶的Minkowsi p-范数)

              Unfortunately scipy's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)

              • XY

              改为使用:

              k_d, k_i = X_tree.query(X, k = k)
              

              • 任意指标
              • BallTree 与 KDTree 具有相似的算法属性.我不知道 Python 中的并行/矢量化/快速 BallTree,但使用 scipy 我们仍然可以对用户定义的指标进行合理的 KNN 查询.如果可用,内置指标会更快.

                A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.

                def d(a, b):
                    return max(np.abs(a-b))
                
                tree = sklearn.neighbors.BallTree(X, metric=d)
                k_d, k_i = tree.query(Y)
                

                如果 d() 不是 度量.BallTree 比蛮力更快的唯一原因是因为度量的属性允许它排除某些解决方案.对于真正的任意函数,蛮力实际上是必要的.

                This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.

                • 仅使用 numpy

                最简单的方法就是使用布尔索引:

                The simplest method is just to use boolean indexing:

                mask = D_sq < r**2
                r_i, r_j = np.where(mask)
                r_d = np.sqrt(D_sq[mask])
                

                • 任何包裹
                • 与上面类似,您可以使用scipy.spatial.KDTree.query_ball_point

                  Similar to above, you can use scipy.spatial.KDTree.query_ball_point

                  r_ij = X_tree.query_ball_point(Y, r = r)
                  

                  scipy.spatial.KDTree.query_ball_tree

                  Y_tree = KDTree(Y)
                  r_ij = X_tree.query_ball_tree(Y_tree, r = r)
                  

                  不幸的是,r_ij 最终变成了一个索引数组列表,有些难以解开以备后用.

                  Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.

                  使用cKDTreesparse_distance_matrix容易多了,可以输出一个coo_matrix

                  Much easier is to use cKDTree's sparse_distance_matrix, which can output a coo_matrix

                  from scipy.spatial.distance import cKDTree
                  X_cTree = cKDTree(X)
                  Y_cTree = cKDTree(Y)
                  D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
                  r_i = D_coo.row
                  r_j = D_coo.column
                  r_d = D_coo.data
                  

                  这是距离矩阵的一种非常灵活的格式,因为它仍然是一个实际矩阵(如果转换为 csr)也可以用于许多矢量化操作.

                  This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.

                  这篇关于如何对 numpy 数组进行 n 维距离和最近邻计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆