使用跨不同时间轴和位置的多个条件创建新数据框 [英] Create New Dataframe Using Multiple Conditions Across Different Timeline and Location

查看:47
本文介绍了使用跨不同时间轴和位置的多个条件创建新数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框有一个棘手的问题:

I have the following dataframe with a tricky problem:

Disease  State       Month      Value
Covid    Texas     2020-03        2     
Covid    Texas     2020-04        3     
Covid    Texas     2020-05        4      
Covid    Texas     2020-08        3 
Cancer   Florida   2020-04        4     
Covid    Florida   2020-03        6      
Covid    Florida   2020-04        4      
Flu      Florida   2020-03        5         

我必须列出连续 3 个月的值并创建一个新的数据框.但是,有一些条件:

I have to make a list of values for 3 consecutive months and create a new dataframe. However, there are some conditions:

  1. 将为每个疾病、每个月(从开始到结束:2020 年 2 月 - 2021 年 4 月)和每个州创建列表.

  1. The list will be created for each disease, each month (from start to end: 2020 Feb - 2021 April) and each state.

如果数据集中不存在任何特定月份,则会创建该月份的行并且该月份的值为 0.

If any specific month is absent in the dataset, row for that month would be created and the value for that month would be 0.

所需的输出:

Disease State    Month      ValueList
Covid   Texas    2020-02    [0, 2, 3] (no dataset for Feb 20 but next two months are) 
Covid   Texas    2020-03    [2, 3, 4] (has values for 3 consecutive months)
Covid   Texas    2020-04    [3, 4, 0] (doesn’t have value for 6th month)   
Covid   Texas    2020-05    [4, 0, 0] (has value for present month)
Covid   Texas    2020-06    [0, 0, 3] (has value for 8th month)
Covid   Texas    2020-07    [0, 3, 0] (has value for 8th month)
Covid   Texas    2020-08    [3, 0, 0] (has value for present month)
Covid   Texas    2020-09    [0, 0, 0] (no dataset for next 3 months)  
Covid   Texas    2020-10    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-11    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-12    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-01    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-02    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-03    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-04    [0, 0, 0] (no dataset for next 3 months)

我正在尝试使用此方法填写日期:

I am trying to fill in dates using this:

df3= (df2.set_index('MonthEnd')
   .groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
   .reset_index())

但是,它不会为每个组返回相同的时间范围.它返回该组中最小和最大日期之间的值.

However, it doesn't returns the same time frame for each group. It returns the values between the min and max date in that group.

我不确定我应该如何开始.任何帮助,将不胜感激.谢谢!

I’m not sure how I should start. Any help would be appreciated. Thanks!

推荐答案

让我们从简单的逻辑开始.所以基本上你想为每个组创建从 Feb 2020Apr 2021 的日期范围.

Let's start with simple logic. So basically you want to create date range from Feb 2020 to Apr 2021 for each group.

让我们选取每个组并使用 reindex 添加此日期范围.完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个和当前一个)并将其转换为列表.

let's take each group and add this date range using reindex. Once I am done with adding the date range now I will fill the data and then will perform rolling function to get the 3 consecutive values(considering previous one and current one) and convert it into the list.

我会将这些列表值分配给我的 ValueList 列.然后我会将所有这些修改后的组添加到数据框中.

I will assign these list of list values to my ValueList column. Then I will add all these modified groups to dataframe.

解决方案:

df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)

def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows. 
    n = (3-len(li))
    if n<3:
        li = [0]*n +li
    return li


start = '2020-02'
end = '2021-04'

data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    data = data.append(grp)


使用apply:

def fill_date(grp):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    return grp

 data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)

数据:

<头>
疾病状态ValueList
2020-02-01癌症佛罗里达0[0, 0, 0]
2020-03-01癌症佛罗里达0[0, 0, 0]
2020-04-01癌症佛罗里达4[0, 0, 4]
2020-05-01癌症佛罗里达0[0, 4, 0]
2020-06-01癌症佛罗里达0[4, 0, 0]
2020-07-01癌症佛罗里达0[0, 0, 0]
2020-08-01癌症佛罗里达0[0, 0, 0]
2020-09-01癌症佛罗里达0[0, 0, 0]
2020-10-01癌症佛罗里达0[0, 0, 0]
2020-11-01癌症佛罗里达0[0, 0, 0]
2020-12-01癌症佛罗里达0[0, 0, 0]
2021-01-01癌症佛罗里达0[0, 0, 0]
2021-02-01癌症佛罗里达0[0, 0, 0]
2021-03-01癌症佛罗里达0[0, 0, 0]
2021-04-01癌症佛罗里达0[0, 0, 0]
2020-02-01Covid佛罗里达0[0, 0, 0]
2020-03-01Covid佛罗里达6[0, 0, 6]
2020-04-01Covid佛罗里达4[0, 6, 4]
2020-05-01Covid佛罗里达0[6, 4, 0]
2020-06-01Covid佛罗里达0[4, 0, 0]
2020-07-01Covid佛罗里达0[0, 0, 0]
2020-08-01Covid佛罗里达0[0, 0, 0]
2020-09-01Covid佛罗里达0[0, 0, 0]
2020-10-01Covid佛罗里达0[0, 0, 0]
2020-11-01Covid佛罗里达0[0, 0, 0]
2020-12-01Covid佛罗里达0[0, 0, 0]
2021-01-01Covid佛罗里达0[0, 0, 0]
2021-02-01Covid佛罗里达0[0, 0, 0]
2021-03-01Covid佛罗里达0[0, 0, 0]
2021-04-01Covid佛罗里达0[0, 0, 0]
2020-02-01Covid德克萨斯0[0, 0, 0]
2020-03-01Covid德克萨斯2[0, 0, 2]
2020-04-01Covid德克萨斯3[0, 2, 3]
2020-05-01Covid德克萨斯4[2, 3, 4]
2020-06-01Covid德克萨斯0[3, 4, 0]
2020-07-01Covid德克萨斯0[4, 0, 0]
2020-08-01Covid德克萨斯3[0, 0, 3]
2020-09-01Covid德克萨斯0[0, 3, 0]
2020-10-01Covid德克萨斯0[3, 0, 0]
2020-11-01Covid德克萨斯0[0, 0, 0]
2020-12-01Covid德克萨斯0[0, 0, 0]
2021-01-01Covid德克萨斯0[0, 0, 0]
2021-02-01Covid德克萨斯0[0, 0, 0]
2021-03-01Covid德克萨斯0[0, 0, 0]
2021-04-01Covid德克萨斯0[0, 0, 0]
2020-02-01流感佛罗里达0[0, 0, 0]
2020-03-01流感佛罗里达5[0, 0, 5]
2020-04-01流感佛罗里达0[0, 5, 0]
2020-05-01流感佛罗里达0[5, 0, 0]
2020-06-01流感佛罗里达0[0, 0, 0]
2020-07-01流感佛罗里达0[0, 0, 0]
2020-08-01流感佛罗里达0[0, 0, 0]
2020-09-01流感佛罗里达0[0, 0, 0]
2020-10-01流感佛罗里达0[0, 0, 0]
2020-11-01流感佛罗里达0[0, 0, 0]
2020-12-01流感佛罗里达0[0, 0, 0]
2021-01-01流感佛罗里达0[0, 0, 0]
2021-02-01流感佛罗里达0[0, 0, 0]
2021-03-01流感佛罗里达0[0, 0, 0]
2021-04-01流感佛罗里达0[0, 0, 0]

这篇关于使用跨不同时间轴和位置的多个条件创建新数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆