pandas groupby-apply 行为,返回一个系列(不一致的输出类型) [英] pandas groupby-apply behavior, returning a Series (inconsistent output type)

查看:40
本文介绍了pandas groupby-apply 行为,返回一个系列(不一致的输出类型)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇当 apply 函数返回一个序列时,pandas groupby-apply 的行为.

I'm curious about the behavior of pandas groupby-apply when the apply function returns a series.

当序列长度不同时,返回一个多索引序列.

When the series are of different lengths, it returns a multi-indexed series.

In [1]: import pandas as pd

In [2]: df1=pd.DataFrame({'state':list("AABBB"),
   ...:                 'city':list("vwxyz")})

In [3]: df1
Out[3]:
  city state
0    v     A
1    w     A
2    x     B
3    y     B
4    z     B

In [4]: def f(x):
   ...:         return pd.Series(x['city'].values,index=range(len(x)))
   ...:

In [5]: df1.groupby('state').apply(f)
Out[5]:
state
A      0    v
       1    w
B      0    x
       1    y
       2    z
dtype: object

这将返回一个 Series 对象.

This returns a a Series object.

但是,如果每个系列的长度都相同,那么它会将其转换为 DataFrame.

However, if every series has the same length, then it pivots this into a DataFrame.

In [6]: df2=pd.DataFrame({'state':list("AAABBB"),
   ...:                 'city':list("uvwxyz")})

In [7]: df2
Out[7]:
  city state
0    u     A
1    v     A
2    w     A
3    x     B
4    y     B
5    z     B

In [8]: df2.groupby('state').apply(f)
Out[8]:
       0  1  2
state
A      u  v  w
B      x  y  z

这真的是预期的行为吗?如果我们以这种方式使用 apply ,我们是否打算检查返回类型?或者 apply 中是否有我不喜欢的选项?

Is this really the intended behavior? Are we meant to check the return type if we use apply this way? Or is there an option in apply that I'm not appreciating?

如果您好奇,在我的实际用例中,返回的系列的长度将与组的长度相同.这似乎是 transform 的理想情况,但我发现 apply 返回一个系列实际上在大型数据集上快了一个数量级.那可以是另一个话题.

In case you're curious, in my actual use case, the returned Series will be the same length as the length of the group. It seems like an ideal case for transform except that I've found that apply with returning a Series is actually an order of magnitude faster on a large dataset. That can be another topic.

根据 Parfait 的回答,我们当然可以这样做:

Loosely based on the Parfait's answer, we can certainly do this:

X=df.groupby('state').apply(f)
if not isinstance(X,pd.Series):
    X=X.stack()
X

这将为 df=df1df=df2 提供相同的输出类型.我想我只是在问这是否真的是处理此问题的正常或首选方式.

That will give the same output type for either df=df1 or df=df2. I guess I'm just asking if this is really the normal or preferred way to handle this.

推荐答案

本质上,数据帧由等长系列(技术上是系列对象的字典容器)组成.如熊猫 split-apply-combine 文档中所述,运行groupby() 指的是以下一项或多项

In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combine docs, running a groupby() refers to one or more of the following

  • 根据某些标准将数据分组
  • 对每个组独立应用一个函数
  • 将结果组合成数据结构

请注意,这并不是说始终生成数据帧,而是表示通用的数据结构.所以 groupby() 操作可以向下转换为系列,或者如果给定系列作为输入,可以向上转换为数据帧.

Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.

对于您的第一个数据帧,您运行不相等的分组(或不相等的索引长度)以强制序列返回,这在组合"处理中不能充分产生数据帧.由于数据框不能组合不同长度的系列,它会产生一个多索引系列.您可以通过定义函数中的打印语句看到这一点,state==A 组的长度为 2,B 组的长度为 3.

For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A group having length 2 and B group length 3.

def f(x):
    print(x)
    return pd.Series(x['city'].values, index=range(len(x)))

s1 = df1.groupby('state').apply(f)

print(s1)
#   city state
# 0    v     A
# 1    w     A
#   city state
# 0    v     A
# 1    w     A
#   city state
# 2    x     B
# 3    y     B
# 4    z     B
# state   
# A      0    v
#        1    w
# B      0    x
#        1    y
#        2    z
# dtype: object

但是,您可以通过重置索引从而调整其层次级别来操作多索引系列结果:

However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:

df = df1.groupby('state').apply(f).reset_index()
print(df)

#   state  level_1  0
# 0     A        0  v
# 1     A        1  w
# 2     B        0  x
# 3     B        1  y
# 4     B        2  z

但与您的需求更相关的是unstack() 旋转索引标签的级别,产生一个数据框.考虑使用 fillna() 来填充 None 结果.

But more relevant to your needs is unstack() which pivots a level of the index labels, yielding a data frame. Consider fillna() to fill the None outcome.

df = df1.groupby('state').apply(f).unstack()
print(df)

#        0  1     2
# state            
# A      v  w  None
# B      x  y     z

这篇关于pandas groupby-apply 行为,返回一个系列(不一致的输出类型)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆