pandas 数据帧GroupBy - 显示组统计 [英] Pandas Dataframe GroupBy - Displaying Group Statistics
问题描述
对于熊猫数据框:
导入熊猫为pd
codes = [one,two ,三];
colors = [black,white];
textures = [soft,hard];
N = 100#数据帧长度
df = pd.DataFrame({'id':range(1,N + 1),
'code':[random.choice )for i in range(1,N + 1)],
'color':[random.choice(colors)for i in range(1,N + 1)],
'texture'对于范围(1,N + 1)中的i的random.choice(纹理)],
'size':范围(1,N + 1)中的i的[random.randint(1,100)]
},columns = ['id','code','color','texture','size'])
我运行下面的行以获得按代码分组的汇总大小
和颜色
对:
grouping = df.groupby(['code','color'])agg({'size':np.sum })。reset_index()
>>分组
>>代码颜色大小
>> 0一黑987
>> 1个白色972
>> 2三黑972
>> 3三白488
>> 4两个黑色1162
>> 5两个白色1158
>> [6行×3列]
除了aggreageted(np.sum)大小外,我想获取单独的列:
i。每组的平均值(np.avg)
ii。具有给定组的最大大小的行的ID,
iii。组发生多少次(例如代码= 1,颜色=黑色,12次)
问题:什么是最快的方式做这个?我必须使用 apply()
和专有功能?
您可以传递要应用于组的功能列表,例如:
grouping = df.groupby(['code ','color'])['size']。agg([np.sum,np.average,np.size,np.argmax])。reset_index()
pre>
由于
argmax
是最大行的索引,因此您需要在原始数据框上查找:grouping ['max_row_id'] = df.ix [grouping ['argmax']]。reset_index(grouping.index)注意:我选择了'size'列,因为所有的函数都适用于该列。如果您想对不同的列执行不同的功能集,可以使用
agg
与一个具有函数列表的字典。agg({'size':[np.sum,np.average]})
。这导致MultiIndex
列,这意味着在获取每个组中最大大小的ID时,您需要执行以下操作:grouping ['max_row_id'] = df.ix [grouping ['size'] ['argmax']]。reset_index(grouping.index).id
For the Pandas dataframe:
import pandas as pd codes = ["one","two","three"]; colours = ["black", "white"]; textures = ["soft", "hard"]; N= 100 # length of the dataframe df = pd.DataFrame({ 'id' : range(1,N+1), 'code' : [random.choice(codes) for i in range(1,N+1)], 'colour': [random.choice(colours) for i in range(1,N+1)], 'texture': [random.choice(textures) for i in range(1,N+1)], 'size': [random.randint(1,100) for i in range(1,N+1)] }, columns= ['id','code','colour', 'texture', 'size'])
I run the line below to get the aggregated sizes grouped by
code
andcolour
pairs:grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index() >> grouped >> code colour size >> 0 one black 987 >> 1 one white 972 >> 2 three black 972 >> 3 three white 488 >> 4 two black 1162 >> 5 two white 1158 >> [6 rows x 3 columns]
In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:
i. average value (np.avg) per group
ii. the id of the row with the max size for a given group,
iii. how many times the group occured (e.g. code=one, colour=black, 12 times)
Question: What is the fastest way to do this? Would I have to use
apply()
and a proprietary function?解决方案You can pass a list of functions to be applied to the group, e.g.:
grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()
Since
argmax
is the index of the maximum row, you will need to look them up on the original dataframe:grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id
NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use
agg
with a dictionary with a list of functions e.g.agg({'size': [np.sum, np.average]})
. This results inMultiIndex
columns, which means that when getting the IDs for the maximum size in each group you need to do:grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id
这篇关于 pandas 数据帧GroupBy - 显示组统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!