pandas 基于函数返回单独的DataFrame值 [英] Pandas Return Separate DataFrame Values Based on Function
问题描述
我有两个DataFrame,df1
是场所的位置,df2
是站的位置.我正在尝试找到一种更有效的方法来应用距离函数,以查找在特定范围内的站点并返回站点的名称.如果距离函数是+/- 1
的纬度差,则这是我的预期结果:
I have two DataFrames, df1
are locations of places and df2
are locations of stations. I am trying to find a more efficient way to apply a distance function to find which stations are within a certain range and return the station's name. If the distance function is a Latitude Difference of +/- 1
this is my expected outcome:
# df1
Lat Long
0 30 31
1 37 48
2 54 62
3 67 63
# df2
Station_Lat Station_Long Station
0 30 32 ABC
1 43 48 DEF
2 84 87 GHI
3 67 62 JKL
# ....Some Code that compares df1 and df2....
# result
Lat Long Station_Lat Station_Long Station
30 31 30 32 ABC
67 63 67 62 JKL
我有一个使用笛卡尔积/Cross Join在其上应用功能的解决方案一个DataFrame.此解决方案有效,但是我在一个真实的数据集中有数百万行,这使笛卡尔积非常慢.
I have a solution that uses a cartesian product/Cross Join to apply a function on a single DataFrame. This solution works, but I have millions of rows in a true dataset which makes a cartesian product very slow.
import pandas as pd
df1 = pd.DataFrame({'Lat' : [30, 37, 54, 67],
'Long' : [31, 48, 62, 63]})
df2 = pd.DataFrame({'Station_Lat' : [30, 43, 84, 67],
'Station_Long' : [32, 48, 87, 62],
'Station':['ABC', 'DEF','GHI','JKL']})
# creating a 'key' for a cartesian product
df1['key'] = 1
df2['key'] = 1
# Creating the cartesian Join
df3 = pd.merge(df1, df2, on='key')
# some distance function that returns True or False
# assuming the distance function I want is +/- 1 of two values
def some_distance_func(x,y):
return x-y >= -1 and x-y <= 1
# applying the function to a column using vectorized approach
# https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c
df3['t_or_f'] = list(map(some_distance_func,df3['Lat'],df3['Station_Lat']))
# result
print(df3.loc[df3['t_or_f']][['Lat','Long','Station_Lat','Station_Long','Station']].reset_index(drop=True))
我还尝试了使用 iterrows()
,但这比交叉联接方法要慢.是否有更pythonic/更有效的方法来实现我想要的?
I have also tried a looping approach with iterrows()
, but that is slower than the cross join method. Is there a more pythonic/efficient way to achieve what I am looking for?
推荐答案
You can use pd.cut function to specify proper intervals in which latitudes are contained and simply merge two dataframes to obtain the result:
bins = [(i-1,i+1) for i in df1['Lat']]
bins = [item for subbins in bins for item in subbins]
df1['Interval'] = pd.cut(df1['Lat'], bins=bins)
df2['Interval'] = pd.cut(df2['Station_Lat'], bins=bins)
pd.merge(df1,df2)
此解决方案比您的解决方案要快一些. 10.2 ms ± 201 µs per loop
与12.2 ms ± 1.34 ms per loop
.
This solution is slightly faster than yours. 10.2 ms ± 201 µs per loop
vs 12.2 ms ± 1.34 ms per loop
.
这篇关于 pandas 基于函数返回单独的DataFrame值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!