在groupby对象上应用函数以将行附加到每个组 [英] Applying a function over a groupby object to append a row to each group
问题描述
我有一个相当大的数据集,但是出于可重复性考虑,我有以下多索引数据框:
I have a fairly large dataset but for reproducibility, let's say I have the following multi-indexed data frame:
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')
a
Out[68]:
0 1
first second
bar one 0.705488 2017-07-02 00:00:00
one 0.715645 2017-07-02 00:05:00
two 0.194648 2017-07-02 00:10:00
baz one 0.129729 2017-07-02 00:15:00
two 0.449889 2017-07-02 00:20:00
foo one 0.031531 2017-07-02 00:25:00
two 0.320757 2017-07-02 00:30:00
two 0.876243 2017-07-02 00:35:00
qux one 0.443682 2017-07-02 00:40:00
two 0.802774 2017-07-02 00:45:00
我想将当前时间戳记添加为由第一秒索引组合标识的每个组的新行. (例如bar-one
,bar-two
等)
I want to append the current timestamp as a new row for each group identified by first-second index combinations. (eg. bar-one
, bar-two
, etc)
我做什么:
将时间戳附加到每个组的函数:
A function to append the timestamp to each group:
def myfunction(g, now):
g.loc[g.shape[0], 1] = now # current timestamp
return g
将该功能应用于groupby对象,
Applying the function to a groupby object,
# current timestamp
now = pd.datetime.now()
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
这将返回:
first second 0 1
first second
bar one 0 bar one 0.705488 2017-07-02 00:00:00.000
1 bar one 0.715645 2017-07-02 00:05:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
two 2 bar two 0.194648 2017-07-02 00:10:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
baz one 3 baz one 0.129729 2017-07-02 00:15:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 4 baz two 0.449889 2017-07-02 00:20:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
foo one 5 foo one 0.031531 2017-07-02 00:25:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 6 foo two 0.320757 2017-07-02 00:30:00.000
7 foo two 0.876243 2017-07-02 00:35:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
qux one 8 qux one 0.443682 2017-07-02 00:40:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 9 qux two 0.802774 2017-07-02 00:45:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
我不明白为什么引入了新的索引级别,但是,我可以摆脱它并最终得到我想要的东西:
I can't understand why a new index level has been introduced, however, I can get rid of it and get what I want ultimately:
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]
0 1
first second
bar one 0.705488 2017-07-02 00:00:00.000
one 0.715645 2017-07-02 00:05:00.000
one NaN 2017-07-02 02:05:06.442
two 0.194648 2017-07-02 00:10:00.000
two NaN 2017-07-02 02:05:06.442
baz one 0.129729 2017-07-02 00:15:00.000
one NaN 2017-07-02 02:05:06.442
two 0.449889 2017-07-02 00:20:00.000
two NaN 2017-07-02 02:05:06.442
foo one 0.031531 2017-07-02 00:25:00.000
one NaN 2017-07-02 02:05:06.442
two 0.320757 2017-07-02 00:30:00.000
two 0.876243 2017-07-02 00:35:00.000
two NaN 2017-07-02 02:05:06.442
qux one 0.443682 2017-07-02 00:40:00.000
one NaN 2017-07-02 02:05:06.442
two 0.802774 2017-07-02 00:45:00.000
two NaN 2017-07-02 02:05:06.442
问题:
我想知道是否有一种优雅的,更轻松的方法(将新行添加到每个组,并且-尽管此处未提及-有条件地填充该新行的其余字段,但不包括时间戳字段). )
I'm wondering if there's an elegant, more pandonic way of doing this (Appending a new row to each group and -although not mentioned here - conditionally populating the rest of the fields of that new row other than the timestamp field.)
推荐答案
您可以首先对索引进行分组,为每个组构建所需的其他行,然后将其合并并对df进行排序.
You can groupby index first, building the additional row you need for each group, then concat it back and sort the df.
(
pd.concat([a,
a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
,axis=1)])
.sort_index()
)
Out[538]:
0 1
first second
bar one 0.587648 2017-07-02 00:00:00.000000
one 0.974524 2017-07-02 00:05:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.555171 2017-07-02 00:10:00.000000
two NaN 2017-07-02 15:18:57.503371
baz one 0.832874 2017-07-02 00:15:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.956891 2017-07-02 00:20:00.000000
two NaN 2017-07-02 15:18:57.503371
foo one 0.872959 2017-07-02 00:25:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.056546 2017-07-02 00:30:00.000000
two 0.359184 2017-07-02 00:35:00.000000
two NaN 2017-07-02 15:18:57.503371
qux one 0.301327 2017-07-02 00:40:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.891815 2017-07-02 00:45:00.000000
two NaN 2017-07-02 15:18:57.503371
这篇关于在groupby对象上应用函数以将行附加到每个组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!