在 Pandas 聚合函数中命名返回的列? [英] Naming returned columns in Pandas aggregate function?

查看:18
本文介绍了在 Pandas 聚合函数中命名返回的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 Pandas 的 groupby 功能时遇到问题.我已经阅读了文档,但我看不出来了解如何将聚合函数应用于多个列为这些列自定义名称.

这非常接近,但返回的数据结构具有嵌套的列标题:

data.groupby("国家").agg({"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})

(即,我想取 column2 的均值和标准差,但将这些列返回为均值"和标准差")

我错过了什么?

解决方案

For pandas >= 0.25

命名返回聚合列的功能已在主分支中重新引入并且针对熊猫 0.25.新语法是 .agg(new_col_name=('col_name', 'agg_func').来自上面链接的 PR 的详细示例:

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],...:'高度':[9.1, 6.0, 9.5, 34.0],...:'重量':[7.9, 7.5, 9.9, 198.0]})...:在 [3] 中:df出[3]:种身高体重0 猫 9.1 7.91 只狗 6.0 7.52 猫 9.5 9.93 狗 34.0 198.0在 [4]: df.groupby('kind').agg(min_height=('height', 'min'),max_weight=('weight', 'max'))出[4]:min_height max_weight种类猫 9.1 9.9狗 6.0 198.0

根据 这个 PR.再次复制 PR 中的示例:

In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})在 [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})出[3]:乙<拉姆达><λ1>一个0 1

然后.rename(),或者一次性:

在[4]中:df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))出[4]:乙丙一个0 0

<小时>

对于熊猫 <0.25

unutbu 描述的当前接受的答案是在 Pandas 版本 <= 0.20 中执行此操作的好方法.但是,从 pandas 0.20 开始,使用此方法会引发警告,表明该语法在未来的 pandas 版本中将不可用.

系列:

<块引用>

FutureWarning:不推荐在系列上使用 dict 进行聚合,并将在未来版本中删除

数据帧:

<块引用>

FutureWarning:不推荐使用重命名的字典,并将在未来版本中删除

根据 pandas 0.20 changelog,聚合时重命名列的推荐方式如下.

# 创建一个示例数据框df = pd.DataFrame({'A': [1, 1, 1, 2, 2],'B':范围(5),'C':范围(5)})# ==== 单列(系列)====# 语法即将被弃用df.groupby('A').B.agg({'foo': 'count'})# 推荐的替换语法df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})# ==== 多列 ====# 语法即将被弃用df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})# 推荐的替换语法df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})# 由于推荐的语法比较冗长,括号可以# 用于引入换行符并增加可读性(df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'}))

请参阅 0.20 变更日志了解更多详情.

更新 2017-01-03 以回应@JunkMechanic 的评论.

使用旧式字典语法,可以将多个 lambda 函数传递给 .agg,因为这些函数将使用传递的字典中的键进行重命名:

<预><代码>>>>df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})乙最大最小一个1 2 02 4 3

也可以将多个函数作为列表传递给单个列:

<预><代码>>>>df.groupby('A').agg({'B': [np.min, np.max]})乙阿明最大一个1 0 22 3 4

然而,这不适用于 lambda 函数,因为它们是匿名的并且都返回 ,这会导致名称冲突:

<预><代码>>>>df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})规范错误:函数名称必须唯一,发现多个命名的 <lambda>

为了避免SpecificationError,可以先验地定义命名函数,而不是使用lambda.合适的函数名也避免了之后在数据帧上调用 .rename .可以使用与上述相同的列表语法传递这些函数:

<预><代码>>>>def my_min(x):>>>返回 x.min()>>>def my_max(x):>>>返回 x.max()>>>df.groupby('A').agg({'B': [my_min, my_max]})乙my_min my_max一个1 0 22 3 4

I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.

This comes very close, but the data structure returned has nested column headings:

data.groupby("Country").agg(
        {"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})

(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")

What am I missing?

解决方案

For pandas >= 0.25

The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func'). Detailed example from the PR linked above:

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                    'height': [9.1, 6.0, 9.5, 34.0],
   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})
   ...:

In [3]: df
Out[3]:
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby('kind').agg(min_height=('height', 'min'), 
                               max_weight=('weight', 'max'))
Out[4]:
      min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0

It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:

In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})

In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})
Out[3]:
         B
  <lambda> <lambda 1>
A
a        0          1

and then .rename(), or in one go:

In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))
Out[4]:
   b  c
A
a  0  0


For pandas < 0.25

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

Series:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
                   'B': range(5),
                   'C': range(5)})

# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})

# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
    .agg({'B': 'sum', 'C': 'min'})
    .rename(columns={'B': 'foo', 'C': 'bar'})
)

Please see the 0.20 changelog for additional details.

Update 2017-01-03 in response to @JunkMechanic's comment.

With the old style dictionary syntax, it was possible to pass multiple lambda functions to .agg, since these would be renamed with the key in the passed dictionary:

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})

    B    
  max min
A        
1   2   0
2   4   3

Multiple functions can also be passed to a single column as a list:

>>> df.groupby('A').agg({'B': [np.min, np.max]})

     B     
  amin amax
A          
1    0    2
2    3    4

However, this does not work with lambda functions, since they are anonymous and all return <lambda>, which causes a name collision:

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .rename on the data frame afterwards. These functions can be passed with the same list syntax as above:

>>> def my_min(x):
>>>     return x.min()

>>> def my_max(x):
>>>     return x.max()

>>> df.groupby('A').agg({'B': [my_min, my_max]})

       B       
  my_min my_max
A              
1      0      2
2      3      4

这篇关于在 Pandas 聚合函数中命名返回的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆