Python代码过滤最接近的距离对 [英] Python code to filter closest distance pairs

查看:86
本文介绍了Python代码过滤最接近的距离对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码.请注意,这只是一个玩具数据集,我的真实数据集中每个表中包含大约1000个条目.

This is my code. Please note that this is just a toy dataset, my real set contains about a 1000 entries in each table.

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

/some calc code here/

##df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]##


df_dist_long.to_csv('dist.csv',float_format='%.2f')

当我添加 df_dist_long.loc [df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()] 时.我收到此错误

When i add df_dist_long.loc[df_dist_long.sort_values('Dist(km)').groupby('neigh_B')['city_A'].min()]. I get this error

 File "C:\Python\Python38\lib\site-packages\pandas\core\groupby\groupby.py", line 656, in wrapper                                                    
    raise ValueError                                                                                                                                  
ValueError    

                                                                        
                                                           

没有它,输出就像这样...

Without it, the output is like so...

    city_A  neigh_B Dist(km)
0   City1   Neigh1  6.45
1   City2   Neigh1  6.42
2   City3   Neigh1  7.93
3   City4   Neigh1  5.56
4   City1   Neigh2  8.25
5   City2   Neigh2  6.67
6   City3   Neigh2  8.55
7   City4   Neigh2  8.92
8   City1   Neigh3  7.01   ..... and so on

我想要的是另一个表格,该表格过滤了距离邻居最近的城市.例如,对于"Neigh1",City4是最接近的(距离最小).所以我想要下面的表格

What I want is another table that filters the city closest to the Neighbour. So as an example, for 'Neigh1', City4 is the closest(least in distance). So I want the table as below

city_A  neigh_B Dist(km)
0   City4   Neigh1  5.56
1   City3   Neigh2  4.32
2   City1   Neigh3  7.93
3   City2   Neigh4  3.21
4   City4   Neigh5  4.56
5   City5   Neigh6  6.67
6   City3   Neigh7  6.16
 ..... and so on

城市名称是否重复并不重要,我只想将最近的一对保存到另一个csv中.专家,请问该如何实施!

Doesn't matter if the city name gets repeated, I just want the closest pair saved to another csv. How can this be implemented, experts, please help!!

推荐答案

如果只想为每个邻国提供最近的城市,则不想计算完整的距离矩阵.

You don't want to calculate the full distance matrix if you just want the closest city for each neighbourhood.

这是一个工作代码示例,尽管我得到的输出与您的输出不同.也许是经纬度错误.

Here is a working code example, though I get different output than yours. Maybe a lat/long mistake.

我使用了您的数据

import pandas as pd
import numpy as np
import sklearn.neighbors

locations_stores = pd.DataFrame({
    'city_A' :     ['City1', 'City2', 'City3', 'City4', ],
    'latitude_A':  [ 56.361176, 56.34061, 56.374749, 56.356624],
    'longitude_A': [ 4.899779, 4.871195, 4.893847, 4.912281]
})
locations_neigh = pd.DataFrame({
    'neigh_B':      ['Neigh1', 'Neigh2', 'Neigh3', 'Neigh4','Neigh5'],
    'latitude_B' : [ 53.314, 53.318, 53.381, 53.338,53.7364],
    'longitude_B': [ 4.955,4.975,4.855,4.873,4.425]
})

创建了一个可以查询的BallTree

Created a BallTree we can querie

from sklearn.neighbors import BallTree
import numpy as np

stores_gps = locations_stores[['latitude_A', 'longitude_A']].values
neigh_gps = locations_neigh[['latitude_B', 'longitude_B']].values

tree = BallTree(stores_gps, leaf_size=15, metric='haversine')

对于每个邻居,我们要最接近( k = 1 )城市/商店:

And for each Neigh we want to closest (k=1) City/Store:

distance, index = tree.query(neigh_gps, k=1)
 
earth_radius = 6371

distance_in_km = distance * earth_radius

我们可以使用以下方式创建结果的数据框

We can create a DataFrame of the result with

pd.DataFrame({
    'Neighborhood' : locations_neigh.neigh_B,
    'Closest_city' : locations_stores.city_A[ np.array(index)[:,0] ].values,
    'Distance_to_city' : distance_in_km[:,0]
})

这给了我

  Neighborhood Closest_city  Distance_to_city
0       Neigh1        City2      19112.334106
1       Neigh2        City2      19014.154744
2       Neigh3        City2      18851.168702
3       Neigh4        City2      19129.555188
4       Neigh5        City4      15498.181486

由于我们的输出不同,因此有一些错误需要更正.也许交换纬度/经度,我只是在这里猜测.但这是您想要的方法,尤其是对于您的数据量.

Since our output is different, there is some mistake to correct. Maybe swapped lat/long, I am just guessing here. But this is the approach you want, especially for the amounts of your data.

对于完整矩阵,请使用

from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('haversine')

earth_radius = 6371

haversine_distances = dist.pairwise(np.radians(stores_gps), np.radians(neigh_gps) )
haversine_distances *= earth_radius

这将提供完整的矩阵,但是请注意,对于更大的数字,将需要很长时间,并且会期望命中内存限制.

This will give the full matrix, but be aware, for largers numbers it will take long, and expect hit memory limitation.

您可以使用numpy的 np.argmin(haversine_distances,axis = 1)从BallTree获得类似的结果.它将得出距离最近的索引,可以像在BallTree示例中那样使用它.

You could use numpy's np.argmin(haversine_distances, axis=1) to get similar results from the BallTree. It will result in the index of the closest in distance, which can be used just like in the BallTree example.

这篇关于Python代码过滤最接近的距离对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆