在 pandas 数据框中合并具有重叠时间段的行 [英] Combining rows with overlapping time periods in a pandas dataframe

查看:62
本文介绍了在 pandas 数据框中合并具有重叠时间段的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究处方习惯,并拥有大量已售产品的数据框.

I am researching prescription habits and have large dataframes of sold products.

我正在尝试通过计算产品将持续多长时间并添加5天的法规遵从性,开始延误等软性因素以计算购买的结束日期,从而将购买的药物转变成药物的疗程.

I am trying to transform purchases of medications into courses of the drugs by calculating how long the product would have lasted and adding a 5 day fudge factor for compliance, starting delays, etc to calculate an end date for the purchase.

然后,我想将处方与重叠的日期窗口结合起来,但是我在努力寻找一种有效的方法来做到这一点.我希望有一个groupby,但我不知道该怎么做.

I then want to combine prescriptions with overlapping date windows but I'm struggling to find an efficient way to do this. I was hoping a groupby would be possible but I can't figure out how to do this.

我知道如何遍历数据框以创建具有相关信息的新数据框,但这是一个缓慢的操作,我希望可以找到一个更优雅的解决方案.

I know how to iterate over the dataframe to create a new dataframe with the relevant information, but it is a slow operation and I am hoping I can find a more elegant solution.

ID      start       end         ingredient  days    dose    end
1000    2018-10-03  2018-10-18  Metron...   10.0    125.00 
1000    2018-10-13  2018-10-25  Metron...   7.0     125.00 
1001    2018-03-08  2018-03-20  Cefalexin   7.0     150.00
1001    2018-09-17  2018-10-05  Cefalexin   13.0    150.00
1002    2018-05-18  2018-05-30  Amoxiclav   7.0     75.00
1002    2018-05-25  2018-06-06  Amoxiclav   7.0     100.00 
1003    2018-07-01  2018-07-16  Amoxiclav   10.0    50.00
1003    2018-07-15  2018-07-30  Amoxiclav   10.0    50.00 
1003    2018-07-25  2018-08-09  Amoxiclav   10.0    50.00 

我的预期结果如下:

ID      start       end         ingredient  days    dose
1000    2018-10-03  2018-10-25  Metron...   17.0    125.00
1001    2018-03-08  2018-03-20  Cefalexin   7.0     150.00
1001    2018-09-17  2018-10-05  Cefalexin   13.0    150.00
1002    2018-05-18  2018-05-30  Amoxiclav   7.0     75.00
1002    2018-05-25  2018-06-06  Amoxiclav   7.0     100.00 
1003    2018-07-01  2018-08-05  Amoxiclav   30.0    50.00

1000 的第二次购买正好是10天,因此结束日期与他们的第二次结束日期相同.

1000's second purchase was exactly 10 days in so the end date is the same as their second end date.

1001 没有重叠,因此保持原样.

1001 did not overlap so remains as they are.

1002 在开始日期和结束日期重叠,但剂量有所变化,因此不应合并.

1002 overlaps on start and end dates but had a change in their dose so should not be combined.

1003 总共有30天.他们最后一次购买的开始日期晚于第一次购买的结束日期.终止日期应为首次购买后的35天.这是一个可以协商的标准,可以接受与最终购买的结束日期相匹配的结束日期.

1003 had 30 days worth in total. The start date of their final purchase is later than the end date of the first. Their end date should be 35 days after they first made a purchase. This is a negotiable criterion and an end date matching the final purchase's end date would be acceptable.

我在这里吠错树了吗?必须重复进行此操作吗?

Am I barking up the wrong tree here? Must this be done iteratively?

推荐答案

我认为这里最大的问题是确定时间间隔何时重叠,其余只是分组和加法.

I think the biggest problem here is to identify when the time intervals are overlapping, the rest is just grouping and addition.

首先,请确保(如果尚未完成的话)将日期转换为 datetime ,并将日期转换为 timedelta .这将有助于比较日期和持续时间并对其进行一些数学运算.

First, be sure (if not already done) to convert your dates to datetime and days in timedelta. This will help to compare dates and durations and perform some math on them.

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['days'] = pd.to_timedelta(df['days'], unit='D')

此代码会产生预期的结果:

This code produces your expected result:

def join_times(x):
    startdf = pd.DataFrame({'time':x['start'], 'what':1})
    enddf = pd.DataFrame({'time':x['end'], 'what':-1})
    mergdf = pd.concat([startdf, enddf]).sort_values('time')
    mergdf['running'] = mergdf['what'].cumsum()
    mergdf['newwin'] = mergdf['running'].eq(1) & mergdf['what'].eq(1)
    mergdf['group'] = mergdf['newwin'].cumsum()
    x['group'] = mergdf['group'].loc[mergdf['what'].eq(1)]
    res = x.groupby('group').agg({'days':'sum', 'start':'first'})
    res['end'] = res.apply(lambda x : x['start'] + x['days'] + pd.to_timedelta(5, unit='D'), axis=1)
    return res

ddf = df.groupby(['ID', 'ingredient', 'dose']).apply(join_times).reset_index().drop('group', axis=1)

这需要解释.如您所见,我使用 groupby 标识子样本.然后,通过自定义 join_times 函数完成该工作.

This needs to be explained. As you see, I use groupby to identify the subsamples. Then the job is done by the custom join_times function.

join_times 函数在单个数据帧的同一列(列'time')开始和结束时间中连接在一起,并按顺序排序.
第二列'what'用+1开始时间和-1结束时间进行标记.这些用于跟踪有多少间隔重叠(在列中使用 cumsum()进行运行" ).
然后,构建布尔列'newwin'以标识新的非重叠时间间隔的开始,并构建列'group'来用相同的整数标记行属于相同的重叠时间间隔.

join_times function joins together in the same column of a single dataframe (column 'time') start and end times, sorted in order.
A second column 'what' marks with +1 starting times, and with -1 ending time. These are used to keep track of how many intervals are overlapping (in colum 'running' using cumsum()).
Then a boolean column 'newwin' is buildt to identify the beginning of a new non overlapping time interval and a column 'group' is buildt to mark with the same integer the rows belonging to the same overlapping time interval.

'group'列添加到原始子样本,将值复制到先前构建的'group'列中.最后,我们可以为每个子样本标识哪些行重叠.
因此,我们可以再次使用 groupby 并求和'days'列,将第一个日期保留为'start'列.
通过将持续时间'days'加5天添加到'start'中来计算'end'列.

The a 'group' column is added to the original subsample, copying the values in the previously buildt 'group' column. Finally, we can identify, for each subsample, which rows have overlapping.
So we can use groupby again and sum the 'days' column, keeping the first date from 'start' column.
'end' column is calculated by adding to 'start' the duration 'days' plus 5 days.

上面的代码使用您的数据样本给出:

The above code, using your data sample, gives:

     ID ingredient   dose    days      start        end
0  1000  Metron...  125.0 17 days 2018-10-03 2018-10-25
1  1001  Cefalexin  150.0  7 days 2018-03-08 2018-03-20
2  1001  Cefalexin  150.0 13 days 2018-09-17 2018-10-05
3  1002  Amoxiclav   75.0  7 days 2018-05-18 2018-05-30
4  1002  Amoxiclav  100.0  7 days 2018-05-25 2018-06-06
5  1003  Amoxiclav   50.0 30 days 2018-07-01 2018-08-05

这是您的预期结果.由于 groupby 具有索引功能,因此列顺序有所不同.

Which is your expected result. Column order is different due to groupby operations with indexing.

这篇关于在 pandas 数据框中合并具有重叠时间段的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆