dplyr通过多种功能汇总/汇总的 pandas 等效于什么? [英] What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?
问题描述
我遇到了从R转换为大熊猫的问题,其中dplyr
包可以轻松地分组并执行多个汇总.
I'm having issues transitioning to pandas from R where dplyr
package can easily group-by and perform multiple summarizations.
请帮助改进我现有的Python pandas代码以进行多种聚合:
Please help improve my existing Python pandas code for multiple aggregations:
import pandas as pd
data = pd.DataFrame(
{'col1':[1,1,1,1,1,2,2,2,2,2],
'col2':[1,2,3,4,5,6,7,8,9,0],
'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
}
)
result = []
for k,v in data.groupby('col1'):
result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])
问题:
- 太冗长 可以被优化和高效. (我将
for-loop groupby
实现重写为groupby.agg
,并且性能增强很大).
- too verbose
- probably can be optimized and efficient. (I rewrote a
for-loop groupby
implementation intogroupby.agg
and the performance enhancement was huge).
在R中,等效代码为:
data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
更新:@ayhan解决了我的问题,这是我将在此处而不是作为评论发表的后续问题:
UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:
Q2)groupby().summarize(newcolumn=max(col2 * col3))
的等效项是什么,即该函数是2+列的复合函数的聚合/汇总?
Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3))
i.e. an aggregation/summarization where the function is a compound function of 2+ columns?
推荐答案
等同于
df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
是
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})
返回
col2 col3
col1
1 5 -5
2 9 -9
返回的对象是pandas.DataFrame,具有一个名为col1
的索引以及名为col2
和col3
的列.默认情况下,对数据熊猫进行分组时,会将分组列设置为索引,以进行有效的访问和修改.但是,如果您不希望这样做,可以使用两种方法将col1
设置为列.
The returning object is a pandas.DataFrame with an index called col1
and columns named col2
and col3
. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1
as a column.
-
通过
as_index=False
:
df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
致电reset_index
:
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
两者都产量
col1 col2 col3
1 5 -5
2 9 -9
您还可以将多个功能传递给groupby.agg
.
You can also pass multiple functions to groupby.agg
.
agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'],
'col3': ['size', 'std', 'mean', 'max']})
这还返回一个DataFrame,但现在它具有用于列的MultiIndex.
This also returns a DataFrame but now it has a MultiIndex for columns.
col2 col3
max min std size std mean max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
MultiIndex对于选择和分组非常方便.以下是一些示例:
MultiIndex is very handy for selection and grouping. Here are some examples:
agg_df['col2'] # select the second column
max min std
col1
1 5 1 1.581139
2 9 0 3.535534
agg_df[('col2', 'max')] # select the maximum of the second column
Out:
col1
1 5
2 9
Name: (col2, max), dtype: int64
agg_df.xs('max', axis=1, level=1) # select the maximum of all columns
Out:
col2 col3
col1
1 5 -1
2 9 0
更早的版本(在版本0.20.0 ),可以使用词典重命名agg
调用中的列.例如
Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the agg
call. For example
df.groupby('col1')['col2'].agg({'max_col2': 'max'})
将第二列的最大值返回为max_col2
:
would return the maximum of the second column as max_col2
:
max_col2
col1
1 5
2 9
但是,不赞成使用重命名方法:
However, it was deprecated in favor of the rename method:
df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})
col2_max
col1
1 5
2 9
对于上面定义的agg_df
这样的DataFrame,它可能会变得冗长.在这种情况下,您可以使用重命名功能来平整这些级别:
It can get verbose for a DataFrame like agg_df
defined above. You can use a renaming function to flatten those levels in that case:
agg_df.columns = ['_'.join(col) for col in agg_df.columns]
col2_max col2_min col2_std col3_size col3_std col3_mean col3_max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
对于类似groupby().summarize(newcolumn=max(col2 * col3))
的操作,您仍然可以通过先使用assign
添加新列来使用agg.
For operations like groupby().summarize(newcolumn=max(col2 * col3))
, you can still use agg by first adding a new column with assign
.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max')
col2 col3 new_col
col1
1 5 -1 -1
2 9 0 0
这将为新旧列返回最大值,但一如既往,您可以对其进行切片.
This returns maximum for old and new columns but as always you can slice that.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')
col1
1 -1
2 0
Name: new_col, dtype: int64
使用groupby.apply
会更短:
df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())
col1
1 -1
2 0
dtype: int64
但是,groupby.apply
将此视为自定义函数,因此不会被向量化.到目前为止,我们传递给agg
的函数("min","max","min","size"等)已向量化,这些是这些优化函数的别名.您可以将df.groupby('col1').agg('min')
替换为df.groupby('col1').agg(min)
,df.groupby('col1').agg(np.min)
或df.groupby('col1').min()
,它们都将执行相同的功能.使用自定义功能时,您不会看到相同的效率.
However, groupby.apply
treats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg
('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')
with df.groupby('col1').agg(min)
, df.groupby('col1').agg(np.min)
or df.groupby('col1').min()
and they will all execute the same function. You will not see the same efficiency when you use custom functions.
最后,从0.20版开始,agg
可以直接在DataFrames上使用,而不必先进行分组.请参见此处的示例.
Lastly, as of version 0.20, agg
can be used on DataFrames directly, without having to group first. See examples here.
这篇关于dplyr通过多种功能汇总/汇总的 pandas 等效于什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!