如何根据从一个数据帧到另一个的两个键找到最接近的匹配? [英] How to find the closest match based on 2 keys from one dataframe to another?

查看:152
本文介绍了如何根据从一个数据帧到另一个的两个键找到最接近的匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据帧我正在使用。一个有一堆位置和坐标(经度,纬度)。另一个是天气数据集,其中包括来自世界各地气象站的数据及其各自的坐标。我试图将最近的气象站连接到我的数据集中的每个位置。气象站名称和我的位置名称不匹配。



我仍然尝试通过坐标中最接近的匹配将它们链接起来,并且不知道从哪里开始。



我在想一些使用



np.abs((location ['纬度'] - 天气['纬度'])+(位置['longitude]] - 天气['longitude'])



每个



位置...

 位置纬度经度组件\\ \\ 
A 39.463744 -76.119411活动
B 39.029252 -76.964251活动
C 33.626946 -85.969576活动
D 49.286337 10.567013活动
E 37.071777 -76.360785活动

天气...

 站码站名称纬度经度
US1FLSL0019 PORT ST。LUCIE 4.0 NE 27.3237 -80.3111
US1TXTV0133 LAKEWAY 2.8 W 30.3597 -98.0252
USC00178998 WALTHAM 44.6917 -68.3475
USC00178998 WALTHAM 44.6917 - 68.3475
USC00178998 WALTHAM 44.6917 -68.3475

输出将是位置数据框上的一个新列,站名称为最接近的比赛



然而,我不知道如何循环来完成这一切。任何帮助将不胜感激。



谢谢,
Scott

解决方案<假设你想要最小化的距离函数 dist

 $($)$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 

对于给定位置,您可以找到最近的车站,如下所示:

  lat = 39.463744 
long = -76.119411
weather.apply(
lambda row:dist(lat,long,row ['Latitude]],row ['Longitude] ]),
axis = 1)

这将计算到所有气象站的距离。使用 idxmin 可以找到最近的车站名称:

  distance = weather .apply(
lambda row:dist(lat,long,row ['Latitude]],row ['Longitude]],
axis = 1)
weather.loc [distance.idxmin (),'StationName']

让我们把这一切放在一个函数中:

  def find_station(lat,long):
distance = weather.apply(
lambda row:dist(lat,long,row ['Latitude'],row ['Longitude]],
axis = 1)
return weather.loc [distance.idxmin(),'StationName']

现在,您可以将所有最近的站点应用到位置 dataframe: / p>

  locations.apply(
lambda row:find_station(row ['Latitude]],row ['Longitude]] ,
axis = 1)

输出:

  0 WALTHAM 
1 WALTHAM
2 PORTST.LUCIE
3 WALTHAM
4 PORTST.LUCIE


I have 2 dataframes I'm working with. One has a bunch of locations and coordinates (longitude, latitude). The other is a weather data set with data from weather stations all over the world and their respective coordinates. I am trying to link up the nearest weather station to each location in my data set. The weather station names and my location names are not matches.

I am left trying to link them by closest match in coordinates and have no idea where to begin.

I was thinking some use of

np.abs((location['latitude']-weather['latitude'])+(location['longitude']-weather['longitude'])

Examples of each

location...

Location   Latitude   Longitude Component  \
     A  39.463744  -76.119411    Active   
     B  39.029252  -76.964251    Active   
     C  33.626946  -85.969576    Active   
     D  49.286337   10.567013    Active   
     E  37.071777  -76.360785    Active   

weather...

     Station Code             Station Name  Latitude  Longitude
     US1FLSL0019    PORT ST. LUCIE 4.0 NE   27.3237   -80.3111
     US1TXTV0133            LAKEWAY 2.8 W   30.3597   -98.0252
     USC00178998                  WALTHAM   44.6917   -68.3475
     USC00178998                  WALTHAM   44.6917   -68.3475
     USC00178998                  WALTHAM   44.6917   -68.3475

Output would be a new column on the location dataframe with the station name that is the closest match

However I am not sure how to loop thru both to accomplish this. Any help would be greatly appreciated..

Thanks, Scott

解决方案

Let's say you have a distance function dist that you want to minimize:

def dist(lat1, long1, lat2, long2):
    return np.abs((lat1-lat2)+(long1-long2))

For a given location, you can find the nearest station as follows:

lat = 39.463744
long = -76.119411
weather.apply(
    lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
    axis=1)

This will calculate the distance to all weather stations. Using idxmin you can find the closest station name:

distances = weather.apply(
    lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
    axis=1)
weather.loc[distances.idxmin(), 'StationName']

Let's put all this in a function:

def find_station(lat, long):
    distances = weather.apply(
        lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
        axis=1)
    return weather.loc[distances.idxmin(), 'StationName']

You can now get all the nearest stations by applying it to the locations dataframe:

locations.apply(
    lambda row: find_station(row['Latitude'], row['Longitude']), 
    axis=1)

Output:

0         WALTHAM
1         WALTHAM
2    PORTST.LUCIE
3         WALTHAM
4    PORTST.LUCIE

这篇关于如何根据从一个数据帧到另一个的两个键找到最接近的匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆