如何在numpy数组上进行n维距离和最近邻计算 [英] How to do n-D distance and nearest neighbor calculations on numpy arrays

查看:168
本文介绍了如何在numpy数组上进行n维距离和最近邻计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该问题旨在成为规范的重复目标

给出形状为(i, n)(j, n)的两个数组XY,分别表示n维坐标的列表

Given two arrays X and Y of shapes (i, n) and (j, n), representing lists of n-dimensional coordinates,

def test_data(n, i, j, r = 100):
    X = np.random.rand(i, n) * r - r / 2
    Y = np.random.rand(j, n) * r - r / 2
    return X, Y

X, Y = test_data(3, 1000, 1000)

最快的查找方法是:

  1. X中每个点与Y中每个点之间的形状为(i,j)的距离D
  2. k最近邻居的索引k_i和距离k_dY
  3. 中每个点的X中所有点的索引
  4. Y中每个点j的距离rX中每个点的索引r_ir_j和距离r_d
  1. The distance D with shape (i,j) between every point in X and every point in Y
  2. The indices k_i and distance k_d of the k nearest neighbors against all points in X for every point in Y
  3. The indices r_i, r_j and distance r_d of every point in X within distance r of every point j in Y

给出以下几组限制:

  • 仅使用numpy
  • 使用任何python软件包
  • Only using numpy
  • Using any python package

包括特殊情况:

  • YX

在所有情况下距离主要是指

In all cases distance primarily means Euclidean distance, but feel free to highlight methods that allow other distance calculations.

推荐答案

1.所有距离

  • 仅使用numpy
  • 天真的方法是:

    D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
    

    但是,这会占用大量内存,从而创建一个(i, j, n)形的中间矩阵,并且速度很慢

    However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow

    但是,由于@Divakar( eucl_dist 包中的一个技巧, wiki ),我们可以使用一些代数和

    However, thanks to a trick from @Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2

    D = np.sqrt(                                #  (X - Y) ** 2   
    np.einsum('ij, ij ->i', X, X)[:, None] +    # = X ** 2        \
    np.einsum('ij, ij ->i', Y, Y)          -    # + Y ** 2        \
    2 * X.dot(Y.T))                             # - 2 * X * Y
    

    • YX
    • 类似于以上:

      XX = np.einsum('ij, ij ->i', X, X)
      D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
      

      请注意,使用此方法时,浮点不精确度可能会使对角线项与零的偏离很小.如果需要确保它们为零,则需要明确设置它:

      Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:

      np.einsum('ii->i', D)[:] = 0 
      

      • 任何软件包
        • Any Package
        • scipy.spatial.distance.cdist 是最直观的内置函数,它比裸numpy

          from scipy.spatial.distance import cdist
          D = cdist(X, Y)
          

          cdist还可以处理许多距离度量以及用户定义的距离度量(尽管未优化).查看上面链接的文档以了解详细信息.

          cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.

          • YX

          对于自指距离,请 scipy.spatial.distance.pdist 的作用与cdist相似,但返回一维压缩距离数组,仅使每个项一次即可节省对称距离矩阵上的空间.您可以使用 squareform

          For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform

          from scipy.spatial.distance import pdist, squareform
          D_cond = pdist(X)
          D = squareform(D_cond)
          

          2. K最近邻居(KNN)

          • 仅使用numpy
          • 2. K Nearest Neighbors (KNN)

            • Only using numpy
            • 我们可以使用np.argpartition来获取k-nearest索引,并使用这些索引来获取相应的距离值.因此,以D作为保存上面获得的距离值的数组,我们将有-

              We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have -

              if k == 1:
                  k_i = D.argmin(0)
              else:
                  k_i = D.argpartition(k, axis = 0)[:k]
              k_d = np.take_along_axis(D, k_i, axis = 0)
              

              但是,在缩减数据集之前,不取平方根可以加快速度. np.sqrt是计算欧几里得范数的最慢的部分,因此我们直到最后都不想这样做.

              However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.

              D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                     np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
              if k == 1:
                  k_i = D_sq.argmin(0)
              else:
                  k_i = D_sq.argpartition(k, axis = 0)[:k]
              k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
              

              现在,np.argpartition执行间接分区,并不一定要按排序顺序为我们提供元素,而只是确保第一个k元素是最小的元素.因此,对于排序后的输出,我们需要在上一步的输出上使用argsort-

              Now, np.argpartition performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step -

              sorted_idx = k_d.argsort(axis = 0)
              k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
              k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
              

              如果只需要k_i,则根本不需要平方根:

              If you only need, k_i, you never need the square root at all:

              D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                     np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
              if k == 1:
                  k_i = D_sq.argmin(0)
              else:
                  k_i = D_sq.argpartition(k, axis = 0)[:k]
              k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
              sorted_idx = k_d_sq.argsort(axis = 0)
              k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
              

              • XY
              • 在上面的代码中,替换为:

                In the above code, replace:

                D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
                       np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
                

                具有:

                XX = np.einsum('ij, ij ->i', X, X)
                D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
                

                • 任何软件包
                  • Any Package
                  • KD树是查找邻居和受约束距离的更快方法.请注意,虽然KDTree通常比3d以上的强力解决方案要快得多(只要oyu的得分超过8分),但是如果您具有n维度,则KDTree仅在具有超过2**n点的情况下才能很好地缩放.有关高尺寸的讨论和更高级的方法,请参见此处

                    KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here

                    最推荐的实现KDTree的方法是使用scipy

                    The most recommended method for implementing KDTree is to use scipy's scipy.spatial.KDTree or scipy.spatial.cKDTree

                    from scipy.spatial.distance import KDTree
                    X_tree = KDTree(X)
                    k_d, k_i = X_tree.query(Y, k = k)
                    

                    不幸的是,scipy的KDTree实现很慢,并且倾向于对较大的数据集进行段错误处理.正如@HansMusgrave所指出的,此处 pykdtree 可以大大提高性能,但与scipy相比,包含并不常见,并且目前只能处理欧几里得距离(而scipy中的KDTree可以处理任何顺序的Minkowsi p范数)

                    Unfortunately scipy's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)

                    • XY

                    改为使用:

                    k_d, k_i = X_tree.query(X, k = k)
                    

                    • 任意指标
                      • Arbitrary metrics
                      • BallTree具有与KDTree相似的算法属性.我不了解Python中的并行/向量化/快速BallTree,但是使用scipy,我们仍然可以对用户定义的指标进行合理的KNN查询.如果可用,内置指标将更快.

                        A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.

                        def d(a, b):
                            return max(np.abs(a-b))
                        
                        tree = sklearn.neighbors.BallTree(X, metric=d)
                        k_d, k_i = tree.query(Y)
                        

                        如果d()不是将是错误的 >公制. BallTree比暴力破解更快的唯一原因是,度量标准的属性允许它排除某些解决方案.对于真正的任意功能,实际上必须使用蛮力.

                        This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.

                        • 仅使用numpy

                        最简单的方法就是使用布尔索引:

                        The simplest method is just to use boolean indexing:

                        mask = D_sq < r**2
                        r_i, r_j = np.where(mask)
                        r_d = np.sqrt(D_sq[mask])
                        

                        • 任何软件包
                          • Any Package
                          • 类似于以上,您可以使用scipy.spatial.KDTree.query_ball_point

                            Similar to above, you can use scipy.spatial.KDTree.query_ball_point

                            r_ij = X_tree.query_ball_point(Y, r = r)
                            

                            scipy.spatial.KDTree.query_ball_tree

                            Y_tree = KDTree(Y)
                            r_ij = X_tree.query_ball_tree(Y_tree, r = r)
                            

                            不幸的是,r_ij最终是索引数组的列表,这些索引数组很难解开以备后用.

                            Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.

                            使用cKDTreesparse_distance_matrix更容易,它可以输出coo_matrix

                            Much easier is to use cKDTree's sparse_distance_matrix, which can output a coo_matrix

                            from scipy.spatial.distance import cKDTree
                            X_cTree = cKDTree(X)
                            Y_cTree = cKDTree(Y)
                            D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
                            r_i = D_coo.row
                            r_j = D_coo.column
                            r_d = D_coo.data
                            

                            这是距离矩阵的一种非常灵活的格式,因为它保留了一个实际的矩阵(如果转换为csr),也可以用于许多矢量化操作.

                            This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.

                            这篇关于如何在numpy数组上进行n维距离和最近邻计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆