填补MultiIndex Pandas Dataframe中的日期空白 [英] Filling in date gaps in MultiIndex Pandas Dataframe

查看:80
本文介绍了填补MultiIndex Pandas Dataframe中的日期空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想修改一个熊猫MultiIndex DataFrame,以使每个索引组都包括指定范围之间的日期.我希望每个小组使用值0(或NaN)填写缺少的日期2013-06-11至2013-12-31.

I would like to modify a pandas MultiIndex DataFrame such that each index group includes Dates between a specified range. I would like each group to fill in missing dates 2013-06-11 to 2013-12-31 with the value 0 (or NaN).

Group A, Group B, Date,           Value
loc_a    group_a  2013-06-11      22
                  2013-07-02      35
                  2013-07-09      14
                  2013-07-30       9
                  2013-08-06       4
                  2013-09-03      40
                  2013-10-01      18
         group_b  2013-07-09       4
                  2013-08-06       2
                  2013-09-03       5
         group_c  2013-07-09       1
                  2013-09-03       2
loc_b    group_a  2013-10-01       3

我已经看到了关于reindex ing的一些讨论,但这是针对简单(非分组)时间序列数据的.

I've seen a few discussions of reindexing, but that is for a simple (non-grouped) time-series data.

有一种简单的方法吗?

以下是我为实现此目的所做的一些尝试.例如:['A', 'B']取消堆叠后,便可以重新索引.

Following are some attempts I've made at accomplishing this. For example: Once I've unstacked by ['A', 'B'], I can then reindex.

df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
                'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
                'Date': ["2013-06-11",
                        "2013-07-02",
                        "2013-07-09",
                        "2013-07-30",
                        "2013-08-06",
                        "2013-09-03",
                        "2013-10-01",
                        "2013-07-09",
                        "2013-08-06",
                        "2013-09-03",
                        "2013-07-09",
                        "2013-09-03",
                        "2013-10-01"],
                 'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})

df.Date = df['Date'].apply(lambda x: pd.to_datetime(x).date())
df = df.set_index(['A', 'B', 'Date'])

dt_start = dt.datetime(2013,6,1)
all_dates = [(dt_start + dt.timedelta(days=x)).date() for x in range(0,60)]

df2 = df.unstack(['A', 'B'])
df3 = df2.reindex(index=all_dates).fillna(0)
df4 = df3.stack(['A', 'B'])

## df4 is about where I want to get, now I'm trying to get it back in the form of df...

df5 = df4.reset_index()
df6 = df5.rename(columns={'level_0' : 'Date'})
df7 = df6.groupby(['A', 'B', 'Date'])['Value'].sum()

最后几行让我有些难过.我希望在df6时可以简单地将set_index返回到['A', 'B', 'Date'],但是由于在初始df DataFrame中对它们进行了分组,所以没有对这些值进行分组.

The last few lines make me a little sad. I was hoping that at df6 I could simply set_index back to ['A', 'B', 'Date'], but that did not group the values as they are grouped in the initial df DataFrame.

是否有关于如何重新索引未堆叠的DataFrame,重新堆叠以及使DataFrame与原始格式相同的想法?

Any thoughts on how I can reindex the unstacked DataFrame, restack, and have the DataFrame in the same format as the original?

推荐答案

您不清楚您所错过的确切日期是什么;我只是假设您要在 在其他地方进行观察的任何日期填写NaN.如果此假设有误,则必须修改我的解决方案.

Your question wasn't clear about exactly which dates you were missing; I'm just assuming that you want to fill NaN for any date for which you do have an observation elsewhere. My solution will have to be amended if this assumption is faulty.

旁注:包含一行以创建DataFrame

Side note: it may be nice to include a line to create the DataFrame

In [55]: df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
   ....:                    'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
   ....:                    'Date': ["2013-06-11",
   ....:                            "2013-07-02",
   ....:                            "2013-07-09",
   ....:                            "2013-07-30",
   ....:                            "2013-08-06",
   ....:                            "2013-09-03",
   ....:                            "2013-10-01",
   ....:                            "2013-07-09",
   ....:                            "2013-08-06",
   ....:                            "2013-09-03",
   ....:                            "2013-07-09",
   ....:                            "2013-09-03",
   ....:                            "2013-10-01"],
   ....:                     'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})

In [56]: 

In [56]: df.Date = pd.to_datetime(df.Date)

In [57]: df = df.set_index(['A', 'B', 'Date'])

In [58]: 

In [58]: print(df)
                          Value
A     B       Date             
loc_a group_a 2013-06-11     22
              2013-07-02     35
              2013-07-09     14
              2013-07-30      9
              2013-08-06      4
              2013-09-03     40
              2013-10-01     18
      group_b 2013-07-09      4
              2013-08-06      2
              2013-09-03      5
      group_c 2013-07-09      1
              2013-09-03      2
loc_b group_a 2013-10-01      3

要填充未观察到的值,我们将使用unstackstack方法.取消堆叠将创建我们感兴趣的NaN,然后将它们堆叠起来以供使用.

To get the unobserved values filled, we'll use the unstack and stack methods. Unstacking will create the NaNs we're interested in, and then we'll stack them up to work with.

In [71]: df.unstack(['A', 'B'])
Out[71]: 
              Value                           
A             loc_a                      loc_b
B           group_a  group_b  group_c  group_a
Date                                          
2013-06-11       22      NaN      NaN      NaN
2013-07-02       35      NaN      NaN      NaN
2013-07-09       14        4        1      NaN
2013-07-30        9      NaN      NaN      NaN
2013-08-06        4        2      NaN      NaN
2013-09-03       40        5        2      NaN
2013-10-01       18      NaN      NaN        3


In [59]: df.unstack(['A', 'B']).fillna(0).stack(['A', 'B'])
Out[59]: 
                          Value
Date       A     B             
2013-06-11 loc_a group_a     22
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-07-02 loc_a group_a     35
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-07-09 loc_a group_a     14
                 group_b      4
                 group_c      1
           loc_b group_a      0
2013-07-30 loc_a group_a      9
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-08-06 loc_a group_a      4
                 group_b      2
                 group_c      0
           loc_b group_a      0
2013-09-03 loc_a group_a     40
                 group_b      5
                 group_c      2
           loc_b group_a      0
2013-10-01 loc_a group_a     18
                 group_b      0
                 group_c      0
           loc_b group_a      3

根据需要对索引级别进行重新排序.

Reorder the index levels as necessary.

我必须将那个fillna(0)滑到中间,以免NaN掉落. stack确实有一个dropna自变量.我认为将其设置为false会保留所有NaN行.可能是错误吗?

I had to slip that fillna(0) in the middle there so that the NaNs weren't dropped. stack does have a dropna argument. I would think that setting that to false would keep the all NaN rows around. A bug maybe?

这篇关于填补MultiIndex Pandas Dataframe中的日期空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆