Pandas groupby 和聚合输出应包括所有原始列(包括未聚合的列) [英] Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)
本文介绍了Pandas groupby 和聚合输出应包括所有原始列(包括未聚合的列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下数据框并且想要:
I have the following data frame and want to:
- 按
月份
对记录进行分组 - 每个唯一
UPC_ID
(每月)的QTY_SOLD
和NET_AMT
总和 - 在结果数据框中也包括其余的列
我认为我能做到的方法是第一:创建一个 month
列来聚合 D_DATES
,然后将 QTY_SOLD
与 相加>UPC_ID
.
The way I thought I can do this is 1st: create a month
column to aggregate the D_DATES
, then sum QTY_SOLD
by UPC_ID
.
脚本:
# Convert date to date time object
df['D_DATE'] = pd.to_datetime(df['D_DATE'])
# Create aggregated months column
df['month'] = df['D_DATE'].apply(dt.date.strftime, args=('%Y.%m',))
# Group by month and sum up quantity sold by UPC_ID
df = df.groupby(['month', 'UPC_ID'])['QTY_SOLD'].sum()
<小时>
当前数据框:
Current data frame:
UPC_ID | UPC_DSC | D_DATE | QTY_SOLD | NET_AMT
----------------------------------------------
111 desc1 2/26/2017 2 10 (2 x $5)
222 desc2 2/26/2017 3 15
333 desc3 2/26/2017 1 4
111 desc1 3/1/2017 1 5
111 desc1 3/3/2017 4 20
期望的输出:
MONTH | UPC_ID | QTY_SOLD | NET_AMT | UPC_DSC
----------------------------------------------
2017-2 111 2 10 etc...
2017-2 222 3 15
2017-2 333 1 4
2017-3 111 5 25
实际输出:
MONTH | UPC_ID
----------------------------------------------
2017-2 111 2
222 3
333 1
2017-3 111 5
...
问题:
- 如何在每一行中包含月份?
- 如何包含数据框的其余列?
- 除了
QTY_SOLD
之外,还如何求和NET_AMT
?
- How do I include the month for each row?
- How do I include the rest of the columns of the dataframe?
- How do also sum
NET_AMT
in addition toQTY_SOLD
?
推荐答案
agg
with a dict
of functions
创建一个dict
函数并将其传递给agg
.您还需要 as_index=False
以防止组列成为输出中的索引.
agg
with a dict
of functions
Create a dict
of functions and pass it to agg
. You'll also need as_index=False
to prevent the group columns from becoming the index in your output.
f = {'NET_AMT': 'sum', 'QTY_SOLD': 'sum', 'UPC_DSC': 'first'}
df.groupby(['month', 'UPC_ID'], as_index=False).agg(f)
month UPC_ID UPC_DSC NET_AMT QTY_SOLD
0 2017.02 111 desc1 10 2
1 2017.02 222 desc2 15 3
2 2017.02 333 desc3 4 1
3 2017.03 111 desc1 25 5
Blanket sum
只需调用 sum
无需任何列名.这处理数字列.对于UPC_DSC
,您需要单独处理.
Blanket sum
Just call sum
without any column names. This handles the numeric columns. For UPC_DSC
, you'll need to handle it separately.
g = df.groupby(['month', 'UPC_ID'])
i = g.sum()
j = g[['UPC_DSC']].first()
pd.concat([i, j], 1).reset_index()
month UPC_ID QTY_SOLD NET_AMT UPC_DSC
0 2017.02 111 2 10 desc1
1 2017.02 222 3 15 desc2
2 2017.02 333 1 4 desc3
3 2017.03 111 5 25 desc1
这篇关于Pandas groupby 和聚合输出应包括所有原始列(包括未聚合的列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文