在 pandas 数据框中相互获取最近点 [英] Get Nearest Point from each other in pandas dataframe

查看:64
本文介绍了在 pandas 数据框中相互获取最近点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

  routeId  latitude_value  longitude_value
  r1       28.210216        22.813209
  r2       28.216103        22.496735
  r3       28.161786        22.842318
  r4       28.093110        22.807081
  r5       28.220370        22.503500
  r6       28.220370        22.503500
  r7       28.220370        22.503500

据此,我想生成一个数据框 df2 ,如下所示:

from this i want to generate a dataframe df2 something like this:

routeId    nearest
  r1         r3         (for example)
  r2       ...    similarly for all the routes.

我要实现的逻辑是

对于每条路线,我应该找到所有其他路线的欧几里得距离. 并在routeId上进行迭代.

for every route, i should find the euclidean distance of all other routes. and iterating it on routeId.

有一个计算欧式距离的函数.

There is a function for calculating euclidean distance.

dist = math.hypot(x2 - x1, y2 - y1)

但是我对如何构建传递数据帧或使用.apply()的函数感到困惑

But i am confused on how to build a function where i would pass a dataframe, or use .apply()

def  get_nearest_route():
    .....
    return df2

推荐答案

我们可以使用

We can use scipy.spatial.distance.cdist or multiple for loops then replace min with routes and find the closest i.e

mat = scipy.spatial.distance.cdist(df[['latitude_value','longitude_value']], 
                              df[['latitude_value','longitude_value']], metric='euclidean')

# If you dont want scipy, you can use plain python like 
# import math
# mat = []
# for i,j in zip(df['latitude_value'],df['longitude_value']):
#     k = []
#     for l,m in zip(df['latitude_value'],df['longitude_value']):
#         k.append(math.hypot(i - l, j - m))
#     mat.append(k)
# mat = np.array(mat)

new_df = pd.DataFrame(mat, index=df['routeId'], columns=df['routeId']) 

new_df

routeId        r1        r2        r3        r4        r5        r6        r7
routeId                                                                      
r1       0.000000  0.316529  0.056505  0.117266  0.309875  0.309875  0.309875
r2       0.316529  0.000000  0.349826  0.333829  0.007998  0.007998  0.007998
r3       0.056505  0.349826  0.000000  0.077188  0.343845  0.343845  0.343845
r4       0.117266  0.333829  0.077188  0.000000  0.329176  0.329176  0.329176
r5       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r6       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r7       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000    

#Replace minimum distance with column name and not the minimum with `False`.
# new_df[new_df != 0].min(),0). This gives a mask matching minimum other than zero.  
closest = np.where(new_df.eq(new_df[new_df != 0].min(),0),new_df.columns,False)

# Remove false from the array and get the column names as list . 
df['close'] = [i[i.astype(bool)].tolist() for i in closest]


 routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500          [r2]
5      r6       28.220370        22.503500          [r2]
6      r7       28.220370        22.503500          [r2] 

如果您不想忽略零,那么

If you dont want to ignore zero then

# Store the array values in a variable
arr = new_df.values
# We dont want to find mimimum to be same point, so replace diagonal by nan
arr[np.diag_indices_from(new_df)] = np.nan

# Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)

# Get column names ignoring false. 
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]

   routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500      [r6, r7]
5      r6       28.220370        22.503500      [r5, r7]
6      r7       28.220370        22.503500      [r5, r6]

这篇关于在 pandas 数据框中相互获取最近点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆