在groupby对象上应用函数以将行附加到每个组 [英] Applying a function over a groupby object to append a row to each group

查看:95
本文介绍了在groupby对象上应用函数以将行附加到每个组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集,但是出于可重复性考虑,我有以下多索引数据框:

I have a fairly large dataset but for reproducibility, let's say I have the following multi-indexed data frame:

arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
             ['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')

a
Out[68]: 
                     0                   1
first second                              
bar   one     0.705488 2017-07-02 00:00:00
      one     0.715645 2017-07-02 00:05:00
      two     0.194648 2017-07-02 00:10:00
baz   one     0.129729 2017-07-02 00:15:00
      two     0.449889 2017-07-02 00:20:00
foo   one     0.031531 2017-07-02 00:25:00
      two     0.320757 2017-07-02 00:30:00
      two     0.876243 2017-07-02 00:35:00
qux   one     0.443682 2017-07-02 00:40:00
      two     0.802774 2017-07-02 00:45:00

我想将当前时间戳记添加为由第一秒索引组合标识的每个组的新行. (例如bar-onebar-two等)

I want to append the current timestamp as a new row for each group identified by first-second index combinations. (eg. bar-one, bar-two, etc)

我做什么:

将时间戳附加到每个组的函数:

A function to append the timestamp to each group:

def myfunction(g, now):
    g.loc[g.shape[0], 1] = now # current timestamp
    return g

将该功能应用于groupby对象,

Applying the function to a groupby object,

# current timestamp
now = pd.datetime.now()

a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))

这将返回:

               first second         0                       1
first second                                                 
bar   one    0   bar    one  0.705488 2017-07-02 00:00:00.000
             1   bar    one  0.715645 2017-07-02 00:05:00.000
             2   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    2   bar    two  0.194648 2017-07-02 00:10:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
baz   one    3   baz    one  0.129729 2017-07-02 00:15:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    4   baz    two  0.449889 2017-07-02 00:20:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
foo   one    5   foo    one  0.031531 2017-07-02 00:25:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    6   foo    two  0.320757 2017-07-02 00:30:00.000
             7   foo    two  0.876243 2017-07-02 00:35:00.000
             2   NaN    NaN       NaN 2017-07-02 02:05:06.442
qux   one    8   qux    one  0.443682 2017-07-02 00:40:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    9   qux    two  0.802774 2017-07-02 00:45:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442

我不明白为什么引入了新的索引级别,但是,我可以摆脱它并最终得到我想要的东西:

I can't understand why a new index level has been introduced, however, I can get rid of it and get what I want ultimately:

a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]

                     0                       1
first second                                  
bar   one     0.705488 2017-07-02 00:00:00.000
      one     0.715645 2017-07-02 00:05:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.194648 2017-07-02 00:10:00.000
      two          NaN 2017-07-02 02:05:06.442
baz   one     0.129729 2017-07-02 00:15:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.449889 2017-07-02 00:20:00.000
      two          NaN 2017-07-02 02:05:06.442
foo   one     0.031531 2017-07-02 00:25:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.320757 2017-07-02 00:30:00.000
      two     0.876243 2017-07-02 00:35:00.000
      two          NaN 2017-07-02 02:05:06.442
qux   one     0.443682 2017-07-02 00:40:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.802774 2017-07-02 00:45:00.000
      two          NaN 2017-07-02 02:05:06.442

问题:

我想知道是否有一种优雅的,更轻松的方法(将新行添加到每个组,并且-尽管此处未提及-有条件地填充该新行的其余字段,但不包括时间戳字段). )

I'm wondering if there's an elegant, more pandonic way of doing this (Appending a new row to each group and -although not mentioned here - conditionally populating the rest of the fields of that new row other than the timestamp field.)

推荐答案

您可以首先对索引进行分组,为每个组构建所需的其他行,然后将其合并并对df进行排序.

You can groupby index first, building the additional row you need for each group, then concat it back and sort the df.

(
    pd.concat([a, 
               a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
               ,axis=1)])
    .sort_index()
)

Out[538]: 
                     0                          1
first second                                     
bar   one     0.587648 2017-07-02 00:00:00.000000
      one     0.974524 2017-07-02 00:05:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.555171 2017-07-02 00:10:00.000000
      two          NaN 2017-07-02 15:18:57.503371
baz   one     0.832874 2017-07-02 00:15:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.956891 2017-07-02 00:20:00.000000
      two          NaN 2017-07-02 15:18:57.503371
foo   one     0.872959 2017-07-02 00:25:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.056546 2017-07-02 00:30:00.000000
      two     0.359184 2017-07-02 00:35:00.000000
      two          NaN 2017-07-02 15:18:57.503371
qux   one     0.301327 2017-07-02 00:40:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.891815 2017-07-02 00:45:00.000000
      two          NaN 2017-07-02 15:18:57.503371

这篇关于在groupby对象上应用函数以将行附加到每个组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆