Pandas GroupBy.apply 方法复制第一组 [英] Pandas GroupBy.apply method duplicates first group

查看:32
本文介绍了Pandas GroupBy.apply 方法复制第一组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的第一个问题:我对 Pandas (0.12.0-4) 中 groupby 的 apply 方法的这种行为感到困惑,它似乎将函数 TWICE 应用于数据框的第一行.例如:

<预><代码>>>>从熊猫导入系列,数据帧>>>将熊猫导入为 pd>>>df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})>>>打印(df)班级数0 一 11 乙 02 C 2

我首先检查 groupby 函数是否可以正常工作,并且似乎没问题:

<预><代码>>>>对于 df.groupby('class', group_keys = True) 中的组:>>>印刷(群展)('A', 班级数0 一 1)('B', 班级数1 乙 0)('C', 班级数2 C 2)

然后我尝试在 groupby 对象上使用 apply 来做类似的事情,我得到了两次第一行输出:

<预><代码>>>>def checkit(组):>>>印刷(群展)>>>df.groupby('class', group_keys = True).apply(checkit)班级数0 一 1班级数0 一 1班级数1 乙 0班级数2 C 2

任何帮助将不胜感激!谢谢.

@Jeff 在下面提供了答案.一头雾水,一时没看懂,所以这里举个简单的例子来说明,尽管上面例子中第一组打印了两次,apply方法对第一组只操作了一次,不会对原始数据框进行变异:

<预><代码>>>>定义插件(组):>>>组['计数'] += 1>>>返回组>>>df.groupby('class', group_keys = True).apply(addone)>>>打印(df)班级数0 一 11 乙 02 C 2

但是通过将方法的返回值分配给一个新对象,我们看到它按预期工作:

<预><代码>>>>df2 = df.groupby('class', group_keys = True).apply(addone)>>>打印(df2)班级数0 A 21 乙 12 C 3

解决方案

这是设计使然,如上所述 这里 这里

apply 函数需要知道返回数据的形状,才能智能地判断如何组合.为此,它调用函数(在您的情况下为 checkit)两次以实现此目的.

根据您的实际用例,您可以将 apply 的调用替换为 aggregatetransformfilter>,如此处的详细描述.这些函数要求返回值是特定的形状,因此不要两次调用该函数.

但是 - 如果您正在调用的函数没有副作用,则该函数在第一个值上被调用两次很可能无关紧要.

My first SO question: I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
   class  count  
0     A      1  
1     B      0    
2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):
>>>     print(group)
('A',   class  count
0     A      1)
('B',   class  count
1     B      0)
('C',   class  count
2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):
>>>     print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
  class  count
0     A      1
  class  count
0     A      1
  class  count
1     B      0
  class  count
2     C      2

Any help would be appreciated! Thanks.

Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:

>>> def addone(group):
>>>     group['count'] += 1
>>>     return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

      class  count
0     A      1
1     B      0
2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

>>> df2 = df.groupby('class', group_keys = True).apply(addone)
>>> print(df2)

      class  count
0     A      2
1     B      1
2     C      3

解决方案

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

这篇关于Pandas GroupBy.apply 方法复制第一组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆