识别满足数据帧中三个条件的两行的组 [英] Identifying groups of two rows that satisfy three conditions in a dataframe
问题描述
我在下面有df,想确定满足以下所有条件的任意两个订单:
I have the df below and want to identify any two orders that satisfy all the following condtions:
- 皮卡之间的距离小于X英里
- 下车距离减去Y英里
- 订单创建时间之间的差异少于Z分钟
将使用Haversine导入Haversine来计算每一行的取件差异以及每一行或订单的下车差异.
Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.
我当前拥有的df如下所示:
The df I currently have looks like the following:
DAY Order pickup_lat pickup_long dropoff_lat dropoff_long created_time
1/3/19 234e 32.69 -117.1 32.63 -117.08 3/1/19 19:00
1/3/19 235d 40.73 -73.98 40.73 -73.99 3/1/19 23:21
1/3/19 253w 40.76 -73.99 40.76 -73.99 3/1/19 15:26
2/3/19 231y 36.08 -94.2 36.07 -94.21 3/2/19 0:14
3/3/19 305g 36.01 -78.92 36.01 -78.95 3/2/19 0:09
3/3/19 328s 36.76 -119.83 36.74 -119.79 3/2/19 4:33
3/3/19 286n 35.76 -78.78 35.78 -78.74 3/2/19 0:43
我希望我的输出df为满足上述条件的任意2个订单或行.我不确定的是如何计算数据帧中的每一行,以返回满足这些条件的任何两行.
I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.
我希望我能正确解释所需的输出.感谢您的光临!
I hope I am explaining my desired output correctly. Thanks for looking!
推荐答案
我不知道这是否是最佳解决方案,但我没有提出其他建议.我所做的:
I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:
- 创建具有所有可能订单组合的数据框,
- 计算了所有需要的度量,对于所有组合,我将这些度量列添加到了数据框,
- 找到满足上述条件的行的索引.
代码:
#create dataframe with all combination
from itertools import combinations
index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)
from haversine import haversine
def distance(row):
loc_0 = (row[0],row[1]) # (lat, lon)
loc_1 = (row[2],row[3])
return haversine(loc_0,loc_1,unit='mi')
#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)
#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)
#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')
#Thresholds
Z = 600
Y = 400
X = 400
#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X = combined["pickup_dist_mi"]<X
dropoff_dist_Y = combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]
数据输出:
Order Order_1 time_diff_min dropoff_dist_mi pickup_dist_mi
(0, 5) 234e 328s 573.0 322.988195 231.300179
(1, 2) 235d 253w 475.0 2.072803 0.896893
(4, 6) 305g 286n 34.0 19.766096 10.233550
希望我能很好地理解您,这会有所帮助.
Hope I understand you well and that will help.
这篇关于识别满足数据帧中三个条件的两行的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!