找出两组点坐标之间的所有最短欧几里得距离 [英] Find all shortest Euclidean distances between two groups of point coordinates
问题描述
我有一个 Pandas DataFrame,其中列 X1, Y1 具有第一组坐标的点坐标,列 X2, Y2 具有第二组坐标的点坐标坐标.两组都是相互独立的.碰巧它们在同一个数据帧中.示例:
I have a Pandas DataFrame, where columns X1, Y1 have point coordinates for the first group of coordinates and columns X2, Y2 have point coordinates for the second group of coordinates. Both groups are independent of each other. It is just happen to be they are in the same dataframe. Example:
X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521
对于 X1, Y1 列中的每个点,我需要在 X2, Y2 列中找到一个点,使得这两个点之间的欧几里德距离最短.
For each point in columns X1, Y1 I need to find a point in column X2, Y2 such that the Euclidean distance between these two points is the shortest.
作为结果,我需要将 X2, Y2 列中找到的点与 X1, Y1 中的相应点放在同一行.此外,我需要将另一列 D 中计算出的最短欧几里得距离增加到同一行.然后对 X1, Y1 列中的每个点重复此过程.
As an outcome I need to place that found point from columns X2, Y2 in the same row as the corresponding point in X1, Y1. Also I need to augment to the same row the computed shortest Euclidean distance in another column D. Then repeat this process for each point in columns X1, Y1.
一种方法是迭代列 X1, Y1 中的行,并为每一行找到列 X2, Y2 中的最短欧几里德距离.不用写for循环,可能有更好的方法来做到这一点.
One way to do this is to iterate rows in columns X1, Y1, and for each row find shortest Euclidean distance in columns X2, Y2. There are may be better ways to do it without writing for loops.
推荐答案
解决方案
使用 Faiss.
pip install faiss
您可以使用稍微快一点的 IndexIVFFlat 代替 IndexFlatL2你来估计结果.
Instead of IndexFlatL2 you can use slightly faster IndexIVFFlat that allows you to approximate results.
import faiss
def get_closest(df: pd.DataFrame)->pd.DataFrame:
d = 2 # dimensionality
xb = np.float32(df[["X2","Y2"]].values)
xb = np.ascontiguousarray(xb)
xq = np.float32(df[["X1","Y1"]].values)
xq = np.ascontiguousarray(xq)
index = faiss.IndexFlatL2(d) # build the index
index.add(xb) # add vectors to the index
D, I = index.search(xq, 1) # actual search
res_df = df[["X1","Y1"]]
res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[I[:,0]].reset_index(drop = True)
res_df["distance"] = D[:,0]
return res_df
get_closest(df)
性能
对于两组中的 1e4 (x,y) 对 - 运行时间:
Performance
For 1e4 (x,y) pairs in both sets - running time:
371 ms ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于 1e5 向量
33.9 s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
这应该类似于使用 scipy 或 NumPy 生成全距离矩阵,但在内存使用方面效率更高,并且不需要进一步搜索此矩阵.
That should be similar to generating full distances matrix, using scipy, or NumPy, but it's much more efficient in terms of memory usage, and do not require a further search on this matrix.
- 在上面的函数中 - 对于
res_df
,我将其设置为df
的一部分,因为您在res_df中进行了更改,因此不推荐使用code> 会影响
df
.这是为了降低内存使用量,如果您想避免不可预测的行为,您可以进行复制. - 如果您需要每个点有 1 个以上的邻居 - 使用 faiss 只需进行最少的修改即可轻松实现.
- In the function above - for
res_df
I'm setting it to be a slice ofdf
that's not recommended since changes you are making inres_df
will affectdf
. That's made for lower memory usage, if you want to avoid unpredictable behaviour you can make a copy. - In case if you need more than 1 neighbour for every point - it's very easy to achieve with faiss with minimum modifications.
替代方案
使用 KDTree
import pandas as pd
from scipy.spatial import KDTree
def get_closest(df: pd.DataFrame)->pd.DataFrame:
tree = KDTree(df[["X1", "Y1"]].values)
dist, ind = tree.query(df[["X2", "Y2"]].values, k=1) # k desired number of neighbors
res_df = df[["X1","Y1"]]
res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[ind].reset_index(drop = True)
res_df["distance"] = dist
return res_df
get_closest(df)
对于两组中的 1e4 (x,y) 对 - 运行时间:
For 1e4 (x,y) pairs in both sets - running time:
1.43 s ± 55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于两组中的 1e5 (x,y) 对 - 运行时间:
For 1e5 (x,y) pairs in both sets - running time:
17 s ± 767 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
使用cdist
,由@Dimon 提出
Using cdist
, proposed by @Dimon
df[['X2','Y2']] = \
df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)
对于两组中的 1e4 (x,y) 对 - 运行时间:
For 1e4 (x,y) pairs in both sets - running time:
543 ms ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于两组中的 1e5 (x,y) 对 - 运行时间:
For 1e5 (x,y) pairs in both sets - running time:
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
使用 numpy
,正如@Valdi_Bo 所提议的
Using numpy
, as proposed by @Valdi_Bo
diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
- df.iloc[:, :2].values[:, np.newaxis, :]
diffs2 = (diffs ** 2).sum(axis=2)
result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')
diffs2.argmin(axis=0)
对于两组中的 1e4 (x,y) 对 - 运行时间:
For 1e4 (x,y) pairs in both sets - running time:
1.6 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于两组中的 1e5 (x,y) 对 - 运行时间:
For 1e5 (x,y) pairs in both sets - running time:
MemoryError: Unable to allocate 149. GiB for an array with shape (100000, 100000, 2) and data type float64
这篇关于找出两组点坐标之间的所有最短欧几里得距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!