识别满足数据帧中三个条件的两行的组 [英] Identifying groups of two rows that satisfy three conditions in a dataframe

查看:89
本文介绍了识别满足数据帧中三个条件的两行的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面有df,想确定满足以下所有条件的任意两个订单:

I have the df below and want to identify any two orders that satisfy all the following condtions:

  1. 皮卡之间的距离小于X英里
  2. 下车距离减去Y英里
  3. 订单创建时间之间的差异少于Z分钟

将使用Haversine导入Haversine来计算每一行的取件差异以及每一行或订单的下车差异.

Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.

我当前拥有的df如下所示:

The df I currently have looks like the following:

  DAY   Order  pickup_lat  pickup_long     dropoff_lat dropoff_long  created_time
 1/3/19  234e    32.69        -117.1          32.63      -117.08   3/1/19 19:00
 1/3/19  235d    40.73        -73.98          40.73       -73.99   3/1/19 23:21
 1/3/19  253w    40.76        -73.99          40.76       -73.99   3/1/19 15:26
 2/3/19  231y    36.08        -94.2           36.07       -94.21   3/2/19 0:14
 3/3/19  305g    36.01        -78.92          36.01       -78.95   3/2/19 0:09
 3/3/19  328s    36.76        -119.83         36.74       -119.79  3/2/19 4:33
 3/3/19  286n    35.76        -78.78          35.78       -78.74   3/2/19 0:43

我希望我的输出df为满足上述条件的任意2个订单或行.我不确定的是如何计算数据帧中的每一行,以返回满足这些条件的任何两行.

I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.

我希望我能正确解释所需的输出.感谢您的光临!

I hope I am explaining my desired output correctly. Thanks for looking!

推荐答案

我不知道这是否是最佳解决方案,但我没有提出其他建议.我所做的:

I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:

  • 创建具有所有可能订单组合的数据框,
  • 计算了所有需要的度量,对于所有组合,我将这些度量列添加到了数据框,
  • 找到满足上述条件的行的索引.

代码:

#create dataframe with all combination 
from itertools import combinations

index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)

from haversine import haversine

def distance(row):
    loc_0 = (row[0],row[1]) # (lat, lon)
    loc_1 = (row[2],row[3])
    return haversine(loc_0,loc_1,unit='mi')

#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)

#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)

#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')

#Thresholds
Z = 600
Y = 400
X = 400

#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X =  combined["pickup_dist_mi"]<X
dropoff_dist_Y =  combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]

数据输出:

        Order Order_1  time_diff_min  dropoff_dist_mi  pickup_dist_mi
(0, 5)  234e    328s          573.0       322.988195      231.300179
(1, 2)  235d    253w          475.0         2.072803        0.896893
(4, 6)  305g    286n           34.0        19.766096       10.233550

希望我能很好地理解您,这会有所帮助.

Hope I understand you well and that will help.

这篇关于识别满足数据帧中三个条件的两行的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆