pandas :具有多种功能的分组和聚合 [英] Pandas: grouping and aggregation with multiple functions

查看:100
本文介绍了 pandas :具有多种功能的分组和聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的熊猫数据帧定义如下:

I have a pandas dataframe defined as follows:

import pandas as pd

headers = ['Group', 'Element', 'Case', 'Score', 'Evaluation']
data = [
    ['A', 1, 'x', 1.40, 0.59],
    ['A', 1, 'y', 9.19, 0.52],
    ['A', 2, 'x', 8.82, 0.80],
    ['A', 2, 'y', 7.18, 0.41],
    ['B', 1, 'x', 1.38, 0.22],
    ['B', 1, 'y', 7.14, 0.10],
    ['B', 2, 'x', 9.12, 0.28],
    ['B', 2, 'y', 4.11, 0.97],
]
df = pd.DataFrame(data, columns=headers)

在控制台输出中看起来像这样:

which looks like this in console output:

  Group  Element Case  Score  Evaluation
0     A        1    x   1.40        0.59
1     A        1    y   9.19        0.52
2     A        2    x   8.82        0.80
3     A        2    y   7.18        0.41
4     B        1    x   1.38        0.22
5     B        1    y   7.14        0.10
6     B        2    x   9.12        0.28
7     B        2    y   4.11        0.97



问题



我想对 df 执行分组和聚合操作,这将为我提供以下结果数据框:

Problem

I'd like to perform a grouping-and-aggregation operation on df that will give me the following result dataframe:

  Group  Max_score_value  Max_score_element  Max_score_case  Min_evaluation
0     A             9.19                  1               y            0.41 
1     B             9.12                  2               x            0.10

要详细说明:我想按 Group 分组列,然后应用聚合以获取以下结果列:

To clarify in more detail: I'd like to group by the Group column, and then apply aggregation to get the following result columns:


  • Max_score_value

  • Max_score_element 分数列中的组最大值。 Element 列中的值,它对应于组最大值 Score 的值。

  • Max_scor e_case Case 列中的值对应于组最大 Score

  • Min_evaluation 评估列中的组最小值。

  • Max_score_value: the group-maximum value from the Score column.
  • Max_score_element: the value from the Element column that corresponds to the group-maximum Score value.
  • Max_score_case: the value from the Case column that corresponds to the group-maximum Score value.
  • Min_evaluation: the group-minimum value from the Evaluation column.

我想出了以下代码对于分组和聚集:

I've come up with the following code for the grouping-and-aggregation:

result = (
    df.set_index(['Element', 'Case'])
    .groupby('Group')
    .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
    .reset_index()
)
print(result)

其输出为:

  Group Score         Evaluation
          max  idxmax        min
0     A  9.19  (1, y)       0.41
1     B  9.12  (2, x)       0.10

可以看到基本数据,但是它不是我需要的格式。这是我努力的最后一步。

As you can see the basic data is there, but it's not quite in the format yet that I need. It's this last step that I'm struggling with. Does anyone here have some good ideas for generating a result dataframe in the format that I'm looking for?

推荐答案

从这里开始,在这里,有人在生成所需格式的结果数据框方面有一些好主意吗? 结果数据框,可以按照以下两个步骤转换为所需的格式:

Starting from the result data frame, you can transform in two steps as follows to the format you need:

# collapse multi index column to single level column
result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
​
# split the idxmax column into two columns
result = result.assign(
    max_score_element = result.idxmax_Score.str[0],
    max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)

result

#Group  max_Score   min_Evaluation  max_score_case  max_score_element
#0   A       9.19             0.41               y                  1
#1   B       9.12             0.10               x                  2






从原始 df 使用 join ,它可能不如@tarashypka的想法那样有效,但不太冗长:


An alternative starting from original df using join, which may not be as efficient but less verbose similar to @tarashypka's idea:

(df.groupby('Group')
   .agg({'Score': 'idxmax', 'Evaluation': 'min'})
   .set_index('Score')
   .join(df.drop('Evaluation',1))
   .reset_index(drop=True))

#Evaluation  Group  Element   Case  Score
#0     0.41      A        1      y   9.19
#1     0.10      B        2      x   9.12






使用示例数据集的原始计时:


Naive timing with the example data set:

%%timeit 
(df.groupby('Group')
 .agg({'Score': 'idxmax', 'Evaluation': 'min'})
 .set_index('Score')
 .join(df.drop('Evaluation',1))
 .reset_index(drop=True))
# 100 loops, best of 3: 3.47 ms per loop

%%timeit
result = (
    df.set_index(['Element', 'Case'])
    .groupby('Group')
    .agg({'Score': ['max', 'idxmax'], 'Evaluation': 'min'})
    .reset_index()
)
​
result.columns = [y + '_' + x if y != '' else x for x, y in result.columns]
​
result = result.assign(
    max_score_element = result.idxmax_Score.str[0],
    max_score_case = result.idxmax_Score.str[1]
).drop('idxmax_Score', 1)
# 100 loops, best of 3: 7.61 ms per loop

这篇关于 pandas :具有多种功能的分组和聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆