dplyr通过多种功能汇总/汇总的 pandas 等效于什么? [英] What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?

查看:75
本文介绍了dplyr通过多种功能汇总/汇总的 pandas 等效于什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了从R转换为大熊猫的问题,其中dplyr包可以轻松地分组并执行多个汇总.

I'm having issues transitioning to pandas from R where dplyr package can easily group-by and perform multiple summarizations.

请帮助改进我现有的Python pandas代码以进行多种聚合:

Please help improve my existing Python pandas code for multiple aggregations:

import pandas as pd
data = pd.DataFrame(
    {'col1':[1,1,1,1,1,2,2,2,2,2],
    'col2':[1,2,3,4,5,6,7,8,9,0],
     'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
    }
)
result = []
for k,v in data.groupby('col1'):
    result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])

问题:

  • 太冗长
  • 可以被优化和高效. (我将for-loop groupby实现重写为groupby.agg,并且性能增强很大).
  • too verbose
  • probably can be optimized and efficient. (I rewrote a for-loop groupby implementation into groupby.agg and the performance enhancement was huge).

在R中,等效代码为:

data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))


更新:@ayhan解决了我的问题,这是我将在此处而不是作为评论发表的后续问题:


UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:

Q2)groupby().summarize(newcolumn=max(col2 * col3))的等效项是什么,即该函数是2+列的复合函数的聚合/汇总?

Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3)) i.e. an aggregation/summarization where the function is a compound function of 2+ columns?

推荐答案

等同于

df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})

返回

      col2  col3
col1            
1        5    -5
2        9    -9

返回的对象是pandas.DataFrame,具有一个名为col1的索引以及名为col2col3的列.默认情况下,对数据熊猫进行分组时,会将分组列设置为索引,以进行有效的访问和修改.但是,如果您不希望这样做,可以使用两种方法将col1设置为列.

The returning object is a pandas.DataFrame with an index called col1 and columns named col2 and col3. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1 as a column.

  • 通过as_index=False:

df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})

  • 致电reset_index:

    df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
    

  • 两者都产量

    col1  col2  col3           
       1     5    -5
       2     9    -9
    

    您还可以将多个功能传递给groupby.agg.

    You can also pass multiple functions to groupby.agg.

    agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'], 
                                     'col3': ['size', 'std', 'mean', 'max']})
    

    这还返回一个DataFrame,但现在它具有用于列的MultiIndex.

    This also returns a DataFrame but now it has a MultiIndex for columns.

         col2               col3                   
          max min       std size       std mean max
    col1                                           
    1       5   1  1.581139    5  1.581139   -3  -1
    2       9   0  3.535534    5  3.535534   -6   0
    

    MultiIndex对于选择和分组非常方便.以下是一些示例:

    MultiIndex is very handy for selection and grouping. Here are some examples:

    agg_df['col2']  # select the second column
          max  min       std
    col1                    
    1       5    1  1.581139
    2       9    0  3.535534
    
    agg_df[('col2', 'max')]  # select the maximum of the second column
    Out: 
    col1
    1    5
    2    9
    Name: (col2, max), dtype: int64
    
    agg_df.xs('max', axis=1, level=1)  # select the maximum of all columns
    Out: 
          col2  col3
    col1            
    1        5    -1
    2        9     0
    

    更早的版本(在版本0.20.0 ),可以使用词典重命名agg调用中的列.例如

    Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the agg call. For example

    df.groupby('col1')['col2'].agg({'max_col2': 'max'})
    

    将第二列的最大值返回为max_col2:

    would return the maximum of the second column as max_col2:

          max_col2
    col1          
    1            5
    2            9
    

    但是,不赞成使用重命名方法:

    However, it was deprecated in favor of the rename method:

    df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})
    
          col2_max
    col1          
    1            5
    2            9
    

    对于上面定义的agg_df这样的DataFrame,它可能会变得冗长.在这种情况下,您可以使用重命名功能来平整这些级别:

    It can get verbose for a DataFrame like agg_df defined above. You can use a renaming function to flatten those levels in that case:

    agg_df.columns = ['_'.join(col) for col in agg_df.columns]
    
          col2_max  col2_min  col2_std  col3_size  col3_std  col3_mean  col3_max
    col1                                                                        
    1            5         1  1.581139          5  1.581139         -3        -1
    2            9         0  3.535534          5  3.535534         -6         0
    

    对于类似groupby().summarize(newcolumn=max(col2 * col3))的操作,您仍然可以通过先使用assign添加新列来使用agg.

    For operations like groupby().summarize(newcolumn=max(col2 * col3)), you can still use agg by first adding a new column with assign.

    df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max') 
    
          col2  col3  new_col
    col1                     
    1        5    -1       -1
    2        9     0        0
    

    这将为新旧列返回最大值,但一如既往,您可以对其进行切片.

    This returns maximum for old and new columns but as always you can slice that.

    df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')
    
    col1
    1   -1
    2    0
    Name: new_col, dtype: int64
    

    使用groupby.apply会更短:

    df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())
    
    col1
    1   -1
    2    0
    dtype: int64
    

    但是,groupby.apply将此视为自定义函数,因此不会被向量化.到目前为止,我们传递给agg的函数("min","max","min","size"等)已向量化,这些是这些优化函数的别名.您可以将df.groupby('col1').agg('min')替换为df.groupby('col1').agg(min)df.groupby('col1').agg(np.min)df.groupby('col1').min(),它们都将执行相同的功能.使用自定义功能时,您不会看到相同的效率.

    However, groupby.apply treats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg ('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min') with df.groupby('col1').agg(min), df.groupby('col1').agg(np.min) or df.groupby('col1').min() and they will all execute the same function. You will not see the same efficiency when you use custom functions.

    最后,从0.20版开始,agg可以直接在DataFrames上使用,而不必先进行分组.请参见此处的示例.

    Lastly, as of version 0.20, agg can be used on DataFrames directly, without having to group first. See examples here.

    这篇关于dplyr通过多种功能汇总/汇总的 pandas 等效于什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆