为什么groupby操作的行为有所不同 [英] Why does groupby operations behave differently

查看:105
本文介绍了为什么groupby操作的行为有所不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 pandas groupby函数并在groupby之后处理输出后,我注意到某些函数在作为索引返回的内容以及如何对其进行操作方面的行为有所不同。

When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.

假设我们有一个包含以下信息的数据框:

Say we have a dataframe with the following information:

    Name   Type  ID
0  Book1  ebook   1
1  Book2  paper   2
2  Book3  paper   3
3  Book1  ebook   1
4  Book2  paper   2

如果愿意

df.groupby(["Name", "Type"]).sum()  

我们得到一个 DataFrame

             ID
Name  Type     
Book1 ebook   2
Book2 paper   4
Book3 paper   3

其中包含MultiIndex机智h groupby中使用的列:

which contains a MultiIndex with the columns used in the groupby:

MultiIndex([('Book1', 'ebook'),
            ('Book2', 'paper'),
            ('Book3', 'paper')],
           names=['Name', 'Type'])

和一列称为 ID

但是如果我应用 size()函数,结果就是 Series

but if I apply a size() function, the result is a Series:

Name   Type 
Book1  ebook    2
Book2  paper    2
Book3  paper    1
dtype: int64

最后,如果我做了 pct_change(),我们只得到结果的DataFrame列:

And at last, if I do a pct_change(), we get only the resulting DataFrame column:

    ID
0   NaN
1   NaN
2   NaN
3   0.0
4   0.0

TL; DR。我想知道为什么有些函数返回 Series 而另一些函数返回 DataFrame ,因为这使我在处理其他函数时感到困惑

TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.

推荐答案

输出不同,因为聚合不同,是什么主要控制返回的内容。想想等效的数组。数据相同,但是一个聚合返回一个标量值,另一个返回与输入大小相同的数组

The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input

import numpy as np
np.array([1,2,3]).sum()
#6

np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)

DataFrameGroupBy对象的聚合也是如此。 groupby 的所有第一部分所做的都是创建从DataFrame到组的映射。由于这实际上没有做任何事情,因此没有理由为什么具有不同操作的同一分组依据需要返回相同类型的输出(请参见上文)。

The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).

gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...

这里的其他重要部分是 DataFrame GroupBy对象。还有 Series GroupBy对象,这种差异可以改变收益。

The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.

gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>






那么汇总时会发生什么?


So what happens when you aggregate?

在选择聚合时使用 DataFrameGroupBy (例如 sum )如果每组折叠为一个单一值,则返回的将是一个DataFrame,其中的索引是唯一的分组键。返回值为 DataFrame ,因为我们提供了一个DataFrameGroupBy对象。 DataFrames可以有多个列,并且如果还有另一个数字列,它也将聚合在一起,因此需要DataFrame输出。

With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.

gp.sum()
#             ID
#Name  Type     
#Book1 ebook   2
#Book2 paper   4
#Book3 paper   3

另一方面您使用SeriesGroupBy对象(使用 [] 选择单个列),然后又获得了Series,并且具有唯一的组键索引。

On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.

df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|

#Name   Type 
#Book1  ebook    2
#Book2  paper    4
#Book3  paper    3
#Name: ID, dtype: int64

对于返回数组的聚合(例如 cumsum pct_change ),DataFrameGroupBy将返回一个DataFrame,而SeriesGroupBy将返回一个Series。但是索引不再是唯一的组密钥。这是因为那将毫无意义。通常,您需要在组中进行计算,然后将结果分配回到原始DataFrame。结果,返回的索引就像为聚合提供的原始DataFrame一样被索引。这使得创建这些列非常简单,因为熊猫可以处理所有对齐方式

For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment

df['ID_pct_change'] = gp.pct_change()

#    Name   Type  ID  ID_pct_change
#0  Book1  ebook   1            NaN  
#1  Book2  paper   2            NaN   
#2  Book3  paper   3            NaN   
#3  Book1  ebook   1            0.0  # Calculated from row 0 and aligned.
#4  Book2  paper   2            0.0






但是 size 呢?那有点古怪。组的大小是标量。该组具有多少列或这些列中的值是否缺失都无关紧要,因此向其发送DataFrameGroupBy或SeriesGroupBy对象是无关紧要的。结果, pandas 将始终返回 Series 。同样,作为返回标量的组级别聚合,用唯一的组键索引返回是有意义的。


But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.

gp.size()
#Name   Type 
#Book1  ebook    2
#Book2  paper    2
#Book3  paper    1
#dtype: int64






最后为了保持完整性,尽管 sum 之类的聚合返回单个标量值,将其返回到每一行通常很有用在原始DataFrame中分组。但是,正常的 .sum 的返回具有不同的索引,因此不会对齐。您可以合并值返回到唯一键,但是 pandas 提供了转换的功能这些聚合。由于此处的目的是将其恢复为原始DataFrame,因此对Series / DataFrame进行索引,就像原始输入


Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input

gp.transform('sum')
#   ID
#0   2    # Row 0 is Book1 ebook which has a group sum of 2
#1   4
#2   3
#3   2    # Row 3 is also Book1 ebook which has a group sum of 2
#4   4

这篇关于为什么groupby操作的行为有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆