Pandas 中的聚合 [英] Aggregation in Pandas

查看:28
本文介绍了Pandas 中的聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 如何使用 Pandas 进行聚合?
  2. 聚合后没有 DataFrame!发生了什么?
  3. 我如何主要聚合字符串列(到 lists、tuples、strings with separator)?
  4. 如何汇总计数?
  5. 如何创建由聚合值填充的新列?

我已经看到这些反复出现的问题,这些问题询问了 Pandas 聚合功能的各个方面.今天关于聚合及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中.这里的目的是为后代整理一些更重要的观点.

这个问答是一系列有用的用户指南的下一部分:

请注意,这篇文章并不是要替代 关于聚合和关于groupby的文档,所以请阅读还有!

解决方案

问题 1

如何使用 Pandas 进行聚合?

扩展了聚合文档.

聚合函数是减少返回对象维度的函数.这意味着输出的 Series/DataFrame 与原始的行数相同或更少.

下表列出了一些常见的聚合函数:

<前>功能 说明mean() 计算组的平均值sum() 计算组值的总和size() 计算组大小count() 计算组的计数std() 组的标准差var() 计算组的方差sem() 组均值的标准误差describe() 生成描述性统计信息first() 计算第一个组值last() 计算组值的最后一个nth() 取第 n 个值,如果 n 是一个列表,则取一个子集min() 计算组值的最小值max() 计算组值的最大值

np.random.seed(123)df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],'B' : ['一', '二', '三', '二', '二', '一'],'C' : np.random.randint(5, size=6),'D' : np.random.randint(5, size=6),'E' : np.random.randint(5, size=6)})打印 (df)A B C D E0 富一 2 3 01 富二 4 1 02 巴三 2 1 13 富二 1 0 34 条 2 3 1 45 富一 2 1 0

按过滤列聚合和 Cython 实现的函数:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()打印 (df1)乙丙0 巴三 21 条 2 32 foo 一 43 富二 5

一个聚合函数用于所有没有在groupby函数中指定的列,这里是A, B列:

df2 = df.groupby(['A', 'B'], as_index=False).sum()打印 (df2)A B C D E0 小节三 2 1 11 条 2 3 1 42 foo 一 4 4 03 富二 5 1 3

你也可以在groupby函数后只指定一些用于聚合的列:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()打印 (df3)A B C D0 巴三 2 11 条 2 3 12 foo 一 4 43 富二 5 1

使用函数 DataFrameGroupBy.agg:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')打印 (df1)乙丙0 巴三 21 条 2 32 foo 一 43 富二 5df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')打印 (df2)A B C D E0 小节三 2 1 11 条 2 3 1 42 foo 一 4 4 03 富二 5 1 3

对于应用于一列的多个函数,请使用 tuple 列表 - 新列和聚合函数的名称:

df4 = (df.groupby(['A', 'B'])['C'].agg([('average','mean'),('total','sum')]).reset_index())打印 (df4)A B 平均总数0 巴三 2.0 21 巴 2 3.0 32 foo 一 2.0 43 富二 2.5 5

如果要传递多个函数可以通过tuples的list:

df5 = (df.groupby(['A', 'B']).agg([('average','mean'),('total','sum')]))打印 (df5)电汇平均总平均总平均总甲乙第三条 2.0 2 1.0 1 1.0 1两个 3.0 3 1.0 1 4.0 4富一 2.0 4 2.0 4 0.0 0两个 2.5 5 0.5 1 1.5 3

然后在列中获取MultiIndex:

print (df5.columns)MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],标签=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

为了转换为列,将 MultiIndex 展平,使用 mapjoin:

df5.columns = df5.columns.map('_'.join)df5 = df5.reset_index()打印 (df5)A B C_average C_total D_average D_total E_average E_total0 巴三 2.0 2 1.0 1 1.0 11 巴 2 3.0 3 1.0 1 4.0 42 富一 2.0 4 2.0 4 0.0 03 富二 2.5 5 0.5 1 1.5 3

另一种解决方案是传递聚合函数列表,然后展平 MultiIndex 并为其他列名称使用 str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])df5.columns = (df5.columns.map('_'.join).str.replace('sum','total').str.replace('mean','average'))df5 = df5.reset_index()打印 (df5)A B C_average C_total D_average D_total E_average E_total0 巴三 2.0 2 1.0 1 1.0 11 巴 2 3.0 3 1.0 1 4.0 42 富一 2.0 4 2.0 4 0.0 03 富二 2.5 5 0.5 1 1.5 3

如果想用聚合函数分别指定每一列,通过dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False).agg({'C':'sum','D':'mean'}).rename(columns={'C':'C_total', 'D':'D_average'}))打印 (df6)A B C_total D_average0 巴三 2 1.01 巴 2 3 1.02 foo 一 4 2.03 富二 5 0.5

您也可以传递自定义函数:

def func(x):返回 x.iat[0] + x.iat[-1]df7 = (df.groupby(['A', 'B'], as_index=False).agg({'C':'sum','D': func}).rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))打印 (df7)A B C_total D_sum_first_and_last0 巴三 2 21 条 2 3 22 foo 一 4 43 富二 5 1

问题 2

聚合后没有 DataFrame!发生了什么?

按两列或多列聚合:

df1 = df.groupby(['A', 'B'])['C'].sum()打印 (df1)甲乙酒吧三 2两个 3富一 4两个 5名称:C,数据类型:int32

首先检查Pandas对象的Indextype:

print (df1.index)MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],标签=[[0, 0, 1, 1], [1, 2, 0, 2]],名称=['A', 'B'])打印(类型(df1))<class 'pandas.core.series.Series'>

如何让MultiIndex Series列到列有两种解决方案:

  • 添加参数as_index=False

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()打印 (df1)乙丙0 巴三 21 条 2 32 foo 一 43 富二 5

df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()打印 (df1)乙丙0 巴三 21 条 2 32 foo 一 43 富二 5


如果按一列分组:

df2 = df.groupby('A')['C'].sum()打印 (df2)一种第 5 小节富 9名称:C,数据类型:int32

... 使用 Index 获取 Series:

print (df2.index)Index(['bar', 'foo'], dtype='object', name='A')打印(类型(df2))<class 'pandas.core.series.Series'>

解决方案和MultiIndex Series中的一样:

df2 = df.groupby('A', as_index=False)['C'].sum()打印 (df2)交流电0 巴 51 富 9df2 = df.groupby('A')['C'].sum().reset_index()打印 (df2)交流电0 巴 51 富 9

问题 3

如何聚合主要字符串列(到lists、元组s、带分隔符的字符串)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],'B' : ['一', '二', '三', '二', '二', '一', '三'],'C' : ['三', '一', '二', '二', '三', '二', '一'],'D' : [1,2,3,2,3,1,2]})打印 (df)A B C D0 一 三 11 C 二一 22 b 三两 33 b 二二 24 一二三 35 c 一二 16 b 三一 2

可以通过listtupleset代替聚合函数来转换列:

df1 = df.groupby('A')['B'].agg(list).reset_index()打印 (df1)甲乙0 a [一,二]1 b [三,二,三]2 c [二,一]

另一种方法是使用 GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index()打印 (df1)甲乙0 a [一,二]1 b [三,二,三]2 c [二,一]

要转换为带分隔符的字符串,仅当它是字符串列时才使用 .join:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()打印 (df2)甲乙0 一、二1 b 三、二、三2 C 两个,一个

如果是数字列,使用带有 astype 用于转换为 strings:

df3 = (df.groupby('A')['D'].agg(lambda x: ','.join(x.astype(str))).reset_index())打印 (df3)广告0 1,31 b 3,2,22 c 2,1

另一种解决方案是在 groupby 之前转换为字符串:

df3 = (df.assign(D = df['D'].astype(str)).groupby('A')['D'].agg(','.join).reset_index())打印 (df3)广告0 1,31 b 3,2,22 c 2,1

要转换所有列,请不要在 groupby 之后传递列列表.没有任何列 D,因为 自动排除麻烦"列.这意味着排除所有数字列.

df4 = df.groupby('A').agg(','.join).reset_index()打印 (df4)乙丙0 一、二、三、三1 b 三、二、三、二、二、一2 c 二,一,二

所以需要把所有的列都转成字符串,然后得到所有的列:

df5 = (df.groupby('A').agg(lambda x: ','.join(x.astype(str))).reset_index())打印 (df5)A B C D0 一、二、三、三 1,31 b 三、二、三、二、二、一 3、2、22 c 二、一、二 2,1

问题 4

如何汇总计数?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],'B' : ['一', '二', '三', '二', '二', '一', '三'],'C' : ['三', np.nan, np.nan, '二', '三', '二', '一'],'D' : [np.nan,2,3,2,3,np.nan,2]})打印 (df)A B C D0 一三 NaN1 c 两个 NaN 2.02 b 三 NaN 3.03 b 二二 2.04 一二三 3.05 c 一二 NaN6 b 三一 2.0

函数GroupBy.每组size的size:

df1 = df.groupby('A').size().reset_index(name='COUNT')打印 (df1)一个 COUNT0 一 21 到 32 c 2

函数GroupBy.count 排除缺失值:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')打印 (df2)一个 COUNT0 一 21 到 22 c 1

这个函数应该用于多列计算非缺失值:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()打印 (df3)A B_COUNT C_COUNT D_COUNT0 2 2 11 b 3 2 32 2 1 1

一个相关的函数是Series.value_counts.它以降序返回包含唯一值计数的对象的大小,因此第一个元素是最常出现的元素.它默认排除 NaN 的值.

df4 = (df['A'].value_counts().rename_axis('A').reset_index(name='COUNT'))打印 (df4)一个 COUNT0 b 31 一个 22 c 2

如果你想要像使用函数 groupby + size 一样的输出,添加 Series.sort_index:

df5 = (df['A'].value_counts().sort_index().rename_axis('A').reset_index(name='COUNT'))打印 (df5)一个 COUNT0 一 21 到 32 c 2

问题 5

如何创建由聚合值填充的新列?

方法 GroupBy.transform 返回一个对象,该对象与被分组的对象索引相同(相同大小).

有关详细信息,请参阅 Pandas 文档.>

np.random.seed(123)df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],'B' : ['一', '二', '三', '二', '二', '一'],'C' : np.random.randint(5, size=6),'D' : np.random.randint(5, size=6)})打印 (df)A B C D0 富一 2 31 富二 4 12 条三 2 13 富二 1 04 条 2 3 15 富一 2 1df['C1'] = df.groupby('A')['C'].transform('sum')df['C2'] = df.groupby(['A','B'])['C'].transform('sum')df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')打印 (df)A B C D C1 C2 C3 D3 C4 D40 foo 一 2 3 9 4 9 5 4 41 富二 4 1 9 5 9 5 5 12 巴三 2 1 5 2 5 2 2 13 富二 1 0 9 5 9 5 5 14 巴 2 3 1 5 3 5 2 3 15 foo 一 2 1 9 4 9 5 4 4

  1. How can I perform aggregation with Pandas?
  2. No DataFrame after aggregation! What happened?
  3. How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?
  4. How can I aggregate counts?
  5. How can I create a new column filled by aggregated values?

I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next instalment in a series of helpful user-guides:

Please note that this post is not meant to be a replacement for the documentation about aggregation and about groupby, so please read that as well!

解决方案

Question 1

How can I perform aggregation with Pandas?

Expanded aggregation documentation.

Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.

Some common aggregating functions are tabulated below:

Function    Description
mean()         Compute mean of groups
sum()         Compute sum of group values
size()         Compute group sizes
count()     Compute count of group
std()         Standard deviation of groups
var()         Compute variance of groups
sem()         Standard error of the mean of groups
describe()     Generates descriptive statistics
first()     Compute first of group values
last()         Compute last of group values
nth()         Take nth value, or a subset if n is a list
min()         Compute min of group values
max()         Compute max of group values

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

Aggregation by filtered columns and Cython implemented functions:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

You can also specify only some columns used for aggregation in a list after the groupby function:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

Same results by using function DataFrameGroupBy.agg:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

If want to pass multiple functions is possible pass list of tuples:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

Then get MultiIndex in columns:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

And for converting to columns, flattening MultiIndex use map with join:

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

If want specified each column with aggregated function separately pass dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

You can pass custom function too:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

Question 2

No DataFrame after aggregation! What happened?

Aggregation by two or more columns:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

First check the Index and type of a Pandas object:

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

There are two solutions for how to get MultiIndex Series to columns:

  • add parameter as_index=False

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5


If group by one column:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... get Series with Index:

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

And the solution is the same like in the MultiIndex Series:

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

Question 3

How can I aggregate mainly strings columns (to lists, tuples, strings with separator)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

An alternative is use GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

For converting to strings with a separator, use .join only if it is a string column:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

If it is a numeric column, use a lambda function with astype for converting to strings:

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

Another solution is converting to strings before groupby:

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

For converting all columns, don't pass a list of column(s) after groupby. There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

So it's necessary to convert all columns into strings, and then get all columns:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

Question 4

How can I aggregate counts?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

Function GroupBy.size for size of each group:

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

Function GroupBy.count excludes missing values:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

This function should be used for multiple columns for counting non-missing values:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

If you want same output like using function groupby + size, add Series.sort_index:

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

Question 5

How can I create a new column filled by aggregated values?

Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.

See the Pandas documentation for more information.

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

这篇关于Pandas 中的聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆