Pandas-GroupBy,然后在原始表上合并 [英] Pandas - GroupBy and then Merge on original table

查看:331
本文介绍了Pandas-GroupBy,然后在原始表上合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个函数来对Pandas中的数据框进行汇总并执行各种统计计算,然后将其合并到原始数据框,但是,我遇到了问题.这与SQL中的代码等效:

I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:

SELECT EID,
       PCODE,
       SUM(PVALUE) AS PVALUE,
       SUM(SQRT(SC*EXP(SC-1))) AS SC,
       SUM(SI) AS SI,
       SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE 

然后加入原始表:

SELECT *
FROM foo_bar_grp INNER JOIN 
foo_bar ON foo_bar.EID = foo_bar_grp.EID 
        AND foo_bar.PCODE = foo_bar_grp.PCODE

以下是步骤:加载数据 IN:>>

pol_dict = {'PID':[1,1,2,2],
             'EID':[123,123,123,123],
             'PCODE':['GU','GR','GU','GR'],
             'PVALUE':[100,50,150,300],
             'SI':[400,40,140,140],
             'SC':[230,23,213,213],
             'EE':[10000,10000,2000,30000],
             }


pol_df = DataFrame(pol_dict)

pol_df

OUT:>>

   EID    EE PCODE  PID  PVALUE   SC   SI
0  123  10000    GU    1     100  230  400
1  123  10000    GR    1      50   23   40
2  123   2000    GU    2     150  213  140
3  123  30000    GR    2     300  213  140

第2步:对数据进行计算和分组:

我的熊猫代码如下:

#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])

#generate acc level aggregate
acc_df = po_grouped_df.agg({
    'PVALUE' : np.sum,
    'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
    'SC' : np.sum,
    'EE' : np.sum
})

在我想加入原始表之前,该方法工作正常:

This works fine until I want to join on the original table:

IN:>>

po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))

OUT:>> KeyError:您没有名为EID的商品

OUT:>> KeyError: u'no item named EID'

由于某些原因,分组后的数据框无法联接回原始表.我已经研究了尝试将groupby列转换为实际列的方法,但这似乎行不通.

For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.

请注意,最终目标是能够找到每一列(PVALUE,SI,SC,EE)IE的百分比:

Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:

pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)

谢谢!

推荐答案

默认情况下,groupby输出将分组列作为索引,而不是列,这就是合并失败的原因.

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

有两种不同的处理方式,最简单的方法是在定义groupby对象时使用as_index参数.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

然后,您的合并应该会按预期进行.

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]: 
   EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \
0  123    GR     236   40000  1.805222e+31         350  10000         50   
1  123    GR     236   40000  1.805222e+31         350  30000        300   
2  123    GU     443   12000  8.765549e+87         250  10000        100   
3  123    GU     443   12000  8.765549e+87         250   2000        150   

   SC_Po  SI_Po  
0     23     40  
1    213    140  
2    230    400  
3    213    140  

这篇关于Pandas-GroupBy,然后在原始表上合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆