Pandas 填充组内缺失的日期和值 [英] Pandas filling missing dates and values within group

查看:36
本文介绍了Pandas 填充组内缺失的日期和值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框

I've a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

我希望能够做的是在日期列中找到最小和最大日期并扩展该列以包含所有日期,同时为 填写 0val 列.所以期望的输出是

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

我已经尝试了此处此处 但它们不是我所追求的.非常感谢任何指点.

I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.

推荐答案

Initial Dataframe:

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

首先,将日期转换为日期时间:

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

然后,生成日期和唯一用户:

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

这将允许您创建一个多索引:

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

您可以使用它来重新索引您的 DataFrame:

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

然后可以按用户排序:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1

这篇关于Pandas 填充组内缺失的日期和值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆