找出两组点坐标之间的所有最短欧几里得距离 [英] Find all shortest Euclidean distances between two groups of point coordinates

查看:43
本文介绍了找出两组点坐标之间的所有最短欧几里得距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Pandas DataFrame,其中列 X1, Y1 具有第一组坐标的点坐标,列 X2, Y2 具有第二组坐标的点坐标坐标.两组都是相互独立的.碰巧它们在同一个数据帧中.示例:

I have a Pandas DataFrame, where columns X1, Y1 have point coordinates for the first group of coordinates and columns X2, Y2 have point coordinates for the second group of coordinates. Both groups are independent of each other. It is just happen to be they are in the same dataframe. Example:

X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521

对于 X1, Y1 列中的每个点,我需要在 X2, Y2 列中找到一个点,使得这两个点之间的欧几里德距离最短.

For each point in columns X1, Y1 I need to find a point in column X2, Y2 such that the Euclidean distance between these two points is the shortest.

作为结果,我需要将 X2, Y2 列中找到的点与 X1, Y1 中的相应点放在同一行.此外,我需要将另一列 D 中计算出的最短欧几里得距离增加到同一行.然后对 X1, Y1 列中的每个点重复此过程.

As an outcome I need to place that found point from columns X2, Y2 in the same row as the corresponding point in X1, Y1. Also I need to augment to the same row the computed shortest Euclidean distance in another column D. Then repeat this process for each point in columns X1, Y1.

一种方法是迭代列 X1, Y1 中的行,并为每一行找到列 X2, Y2 中的最短欧几里德距离.不用写for循环,可能有更好的方法来做到这一点.

One way to do this is to iterate rows in columns X1, Y1, and for each row find shortest Euclidean distance in columns X2, Y2. There are may be better ways to do it without writing for loops.

推荐答案

解决方案


使用 Faiss.

pip install faiss

您可以使用稍微快一点的 IndexIVFFlat 代替 IndexFlatL2你来估计结果.

Instead of IndexFlatL2 you can use slightly faster IndexIVFFlat that allows you to approximate results.

import faiss
def get_closest(df: pd.DataFrame)->pd.DataFrame:
    d = 2 #  dimensionality

    xb = np.float32(df[["X2","Y2"]].values)
    xb = np.ascontiguousarray(xb)
    
    xq = np.float32(df[["X1","Y1"]].values)
    xq = np.ascontiguousarray(xq)

    index = faiss.IndexFlatL2(d) #  build the index
    index.add(xb)                #  add vectors to the index
    
    D, I = index.search(xq, 1)     # actual search
    
    res_df = df[["X1","Y1"]]
    res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[I[:,0]].reset_index(drop = True)
    res_df["distance"] = D[:,0]
    return res_df

get_closest(df)

性能


对于两组中的 1e4 (x,y) 对 - 运行时间:

Performance


For 1e4 (x,y) pairs in both sets - running time:

371 ms ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于 1e5 向量

33.9 s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

这应该类似于使用 scipy 或 NumPy 生成全距离矩阵,但在内存使用方面效率更高,并且不需要进一步搜索此矩阵.

That should be similar to generating full distances matrix, using scipy, or NumPy, but it's much more efficient in terms of memory usage, and do not require a further search on this matrix.

  1. 在上面的函数中 - 对于 res_df,我将其设置为 df 的一部分,因为您在 res_df 会影响 df.这是为了降低内存使用量,如果您想避免不可预测的行为,您可以进行复制.
  2. 如果您需要每个点有 1 个以上的邻居 - 使用 faiss 只需进行最少的修改即可轻松实现.
  1. In the function above - for res_df I'm setting it to be a slice of df that's not recommended since changes you are making in res_df will affect df. That's made for lower memory usage, if you want to avoid unpredictable behaviour you can make a copy.
  2. In case if you need more than 1 neighbour for every point - it's very easy to achieve with faiss with minimum modifications.

替代方案


使用 KDTree

import pandas as pd
from scipy.spatial import KDTree
def get_closest(df: pd.DataFrame)->pd.DataFrame:
    tree = KDTree(df[["X1", "Y1"]].values) 
    dist, ind = tree.query(df[["X2", "Y2"]].values, k=1) # k desired number of neighbors 
    res_df = df[["X1","Y1"]]
    res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[ind].reset_index(drop = True)
    res_df["distance"] = dist
    return res_df
get_closest(df)

对于两组中的 1e4 (x,y) 对 - 运行时间:

For 1e4 (x,y) pairs in both sets - running time:

1.43 s ± 55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于两组中的 1e5 (x,y) 对 - 运行时间:

For 1e5 (x,y) pairs in both sets - running time:

17 s ± 767 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用cdist,由@Dimon 提出

Using cdist, proposed by @Dimon

df[['X2','Y2']] = \
  df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
  metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)

对于两组中的 1e4 (x,y) 对 - 运行时间:

For 1e4 (x,y) pairs in both sets - running time:

543 ms ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于两组中的 1e5 (x,y) 对 - 运行时间:

For 1e5 (x,y) pairs in both sets - running time:

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

使用 numpy,正如@Valdi_Bo 所提议的

Using numpy, as proposed by @Valdi_Bo

diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
    - df.iloc[:, :2].values[:, np.newaxis, :]
diffs2 = (diffs ** 2).sum(axis=2)
result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')
diffs2.argmin(axis=0)

对于两组中的 1e4 (x,y) 对 - 运行时间:

For 1e4 (x,y) pairs in both sets - running time:

1.6 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于两组中的 1e5 (x,y) 对 - 运行时间:

For 1e5 (x,y) pairs in both sets - running time:

MemoryError: Unable to allocate 149. GiB for an array with shape (100000, 100000, 2) and data type float64

这篇关于找出两组点坐标之间的所有最短欧几里得距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆