pandas DataFrame:在重复ID块内聚合值 [英] pandas DataFrame: aggregate values within blocks of repeating IDs

查看:98
本文介绍了pandas DataFrame:在重复ID块内聚合值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个具有ID列和对应的value列的DataFrame,我如何在重复ID的块内汇总(比如说求和)这些值?

Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?

DF示例:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
     'v': np.ones(15)}
    )

请注意,只有两个唯一的ID,因此简单的groupby('id')将不起作用.此外,ID不会定期交替/重复.我想到的是重新创建索引,以表示ID更改的块:

Note that there's only two unique IDs, so a simple groupby('id') won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:

# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])

# generate a new index from m:
idx, i = [], -1
for b in m:
    if b:
        i += 1
    idx.append(i)

# set as index:
df = df.set_index(np.array(idx))

# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0    5.0
# 1    3.0
# 2    2.0
# 3    1.0
# 4    1.0
# 5    3.0

这种重新创建索引的感觉有点像,而不是您在pandas 中的处理方式.我错过了什么?有更好的方法吗?

This re-creation of the index feels sort-of not how you'd do this in pandas. What did I miss? Is there a better way to do this?

推荐答案

在这里需要创建帮助器Series,并使用累积和将不等于ne的移位值与累积和进行比较,并传递给groupby,用于id可以将列一起传递到列表中,先删除reset_index(level=0, drop=True)的第一级MultiIndex,然后将索引转换为列id:

Here is necessary create helper Series with compare shifted values for not equal by ne with cumulative sums and pass to groupby, for id column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True) and then convert index to column id:

print (df['id'].ne(df['id'].shift()).cumsum())
0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    4
11    5
12    6
13    6
14    6
Name: id, dtype: int32

df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df1)
  id    v
0  a  5.0
1  b  3.0
2  a  2.0
3  b  1.0
4  a  1.0
5  b  3.0

另一个想法是使用 GroupBy.agg ,带有字典,并由

Another idea is useGroupBy.agg with dictioanry and aggregate id column by GroupBy.first:

df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
         .agg({'id':'first', 'v':'sum'}))

这篇关于pandas DataFrame:在重复ID块内聚合值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆