pandas DataFrame:在重复ID块内聚合值 [英] pandas DataFrame: aggregate values within blocks of repeating IDs
问题描述
给出一个具有ID列和对应的value列的DataFrame,我如何在重复ID的块内汇总(比如说求和)这些值?
Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?
DF示例:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
'v': np.ones(15)}
)
请注意,只有两个唯一的ID,因此简单的groupby('id')
将不起作用.此外,ID不会定期交替/重复.我想到的是重新创建索引,以表示ID更改的块:
Note that there's only two unique IDs, so a simple groupby('id')
won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:
# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])
# generate a new index from m:
idx, i = [], -1
for b in m:
if b:
i += 1
idx.append(i)
# set as index:
df = df.set_index(np.array(idx))
# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0 5.0
# 1 3.0
# 2 2.0
# 3 1.0
# 4 1.0
# 5 3.0
这种重新创建索引的感觉有点像,而不是您在pandas
中的处理方式.我错过了什么?有更好的方法吗?
This re-creation of the index feels sort-of not how you'd do this in pandas
. What did I miss? Is there a better way to do this?
推荐答案
在这里需要创建帮助器Series
,并使用累积和将不等于ne
的移位值与累积和进行比较,并传递给groupby
,用于id
可以将列一起传递到列表中,先删除reset_index(level=0, drop=True)
的第一级MultiIndex,然后将索引转换为列id
:
Here is necessary create helper Series
with compare shifted values for not equal by ne
with cumulative sums and pass to groupby
, for id
column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True)
and then convert index to column id
:
print (df['id'].ne(df['id'].shift()).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 4
11 5
12 6
13 6
14 6
Name: id, dtype: int32
df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id v
0 a 5.0
1 b 3.0
2 a 2.0
3 b 1.0
4 a 1.0
5 b 3.0
另一个想法是使用 GroupBy.agg
,带有字典,并由
Another idea is useGroupBy.agg
with dictioanry and aggregate id
column by GroupBy.first
:
df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
.agg({'id':'first', 'v':'sum'}))
这篇关于pandas DataFrame:在重复ID块内聚合值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!