pandas 填写组中缺少的日期和值 [英] Pandas filling missing dates and values within group
问题描述
我有一个如下所示的数据框
I've a data frame that looks like the following
x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})
我想做的是在日期列中找到最小和最大日期,并将该列扩展为具有所有日期,同时为val
列填写0
.所以所需的输出是
What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0
for the val
column. So the desired output is
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-03 a 0
3 2016-01-04 a 0
4 2016-01-05 a 0
5 2016-01-06 a 0
6 2016-01-01 b 0
7 2016-01-02 b 0
8 2016-01-03 b 0
9 2016-01-04 b 0
10 2016-01-05 b 2
11 2016-01-06 b 1
我已经尝试在此处和
I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.
推荐答案
初始数据框:
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-05 b 2
3 2016-01-06 b 1
首先,将日期转换为日期时间:
First, convert the dates to datetime:
x['dt'] = pd.to_datetime(x['dt'])
然后,生成日期和唯一用户:
Then, generate the dates and unique users:
dates = x.set_index('dt').resample('D').asfreq().index
>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', name='dt', freq='D')
users = x['user'].unique()
>> array(['a', 'b'], dtype=object)
这将允许您创建一个MultiIndex:
This will allow you to create a MultiIndex:
idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])
>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['dt', 'user'])
您可以使用它来重新索引您的DataFrame:
You can use that to reindex your DataFrame:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out:
dt user val
0 2016-01-01 a 1
1 2016-01-01 b 0
2 2016-01-02 a 33
3 2016-01-02 b 0
4 2016-01-03 a 0
5 2016-01-03 b 0
6 2016-01-04 a 0
7 2016-01-04 b 0
8 2016-01-05 a 0
9 2016-01-05 b 2
10 2016-01-06 a 0
11 2016-01-06 b 1
然后用户可以对其进行排序:
which then can be sorted by users:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out:
dt user val
0 2016-01-01 a 1
2 2016-01-02 a 33
4 2016-01-03 a 0
6 2016-01-04 a 0
8 2016-01-05 a 0
10 2016-01-06 a 0
1 2016-01-01 b 0
3 2016-01-02 b 0
5 2016-01-03 b 0
7 2016-01-04 b 0
9 2016-01-05 b 2
11 2016-01-06 b 1
这篇关于 pandas 填写组中缺少的日期和值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!