Pandas Dataframe:根据其地理坐标(经度和纬度)联接范围内的项目 [英] Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)

查看:95
本文介绍了Pandas Dataframe:根据其地理坐标(经度和纬度)联接范围内的项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一个数据框,其中包含经度和纬度的位置.想象一下城市.

I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.

df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
                   {'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
                   {'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);

现在,我正在尝试使所有城市都围绕另一个半径.假设距柏林500公里,距汉堡500公里的所有城市,等等.我将通过复制原始数据帧并将其与距离函数结合在一起来实现此目的.

Now I'm trying to get all cities in a radius around another. Let's say all cities in a distance of 500km from Berlin, 500km from Hamburg and so on. I would do this by duplicating the original dataframe and joining both with a distance-function.

中间结果如下:

Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin

分组(减少)后的最终结果应该是这样的. 备注:如果值列表包含城市的所有列,那将会很酷.

This final result after grouping (reducing) should be like this. Remark: Would be cool if the list of values includes all columns of the city.

Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]

或者只是一个城市周围500公里内的城市数.

Or just the count of cities 500km around one city.

Berlin --> 2
Potsdam --> 2
Hamburg --> 2

由于我是Python的新手,所以我将不胜感激.我对Haversine距离很熟悉.但是不确定Scipy或Pandas中是否有有用的距离/空间方法.

Since I'm quite new to Python, I would appreciate any starting point. I'm familiar with haversine distance. But not sure if there are useful distance/spatial methods in Scipy or Pandas.

很高兴能给我一个起点.到目前为止,我一直尝试关注这篇文章.

Glad if you can give me a starting point. Up to now I tried following this post.

更新:该问题的初衷来自于两个Sigma Connect出租列表Kaggle比赛.想法是使那些列表在另一个列表周围100m.其中a)表示密度,因此是受欢迎的区域,b)如果地址是比较的,则可以找出是否有交叉路口并因此有嘈杂的区域.因此,由于您不仅需要比较距离,还需要比较地址和其他元数据,因此您不需要完整的项与项之间的关系. PS:我没有将解决方案上传到Kaggle.我只想学习.

Update: The original idea behind this question comes from the Two Sigma Connect Rental Listing Kaggle Competition. The idea is to get those listing 100m around another listing. Which a) indicates a density and therefore a popular area and b) if the addresses are compares, you can find out if there is a crossing and therefore a noisy area. Therefore you not need the full item to item relation since you need to compare not only the distance but also the address and other meta-data. PS: I'm not uploading a solution to Kaggle. I just want to learn.

推荐答案

您可以使用:

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

首先需要与 merge ,通过

First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566

然后应用 haversine 功能:

df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                            row['lat_x'], 
                                            row['lng_y'], 
                                            row['lat_y']), axis=1)

过滤距离:

df = df[df.dist < 500]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120

最后创建list或使用groupby获取size:

df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin     [Potsdam, Hamburg]
Hamburg     [Berlin, Potsdam]
Potsdam     [Berlin, Hamburg]
Name: city_y, dtype: object

df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin     2
Hamburg    2
Potsdam    2
dtype: int64

也可以使用 numpy haversine solution :

Also is possible use numpy haversine solution:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)

df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890

这篇关于Pandas Dataframe:根据其地理坐标(经度和纬度)联接范围内的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆