pandas 数据帧GroupBy - 显示组统计 [英] Pandas Dataframe GroupBy - Displaying Group Statistics

查看:186
本文介绍了 pandas 数据帧GroupBy - 显示组统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于熊猫数据框:

 导入熊猫为pd 
codes = [one,two ,三];
colors = [black,white];
textures = [soft,hard];
N = 100#数据帧长度
df = pd.DataFrame({'id':range(1,N + 1),
'code':[random.choice )for i in range(1,N + 1)],
'color':[random.choice(colors)for i in range(1,N + 1)],
'texture'对于范围(1,N + 1)中的i的random.choice(纹理)],
'size':范围(1,N + 1)中的i的[random.randint(1,100)]
},columns = ['id','code','color','texture','size'])

我运行下面的行以获得按代码分组的汇总大小颜色对:

  grouping = df.groupby(['code','color'])agg({'size':np.sum })。reset_index()
>>分组
>>代码颜色大小
>> 0一黑987
>> 1个白色972
>> 2三黑972
>> 3三白488
>> 4两个黑色1162
>> 5两个白色1158
>> [6行×3列]

除了aggreageted(np.sum)大小外,我想获取单独的列:



i。每组的平均值(np.avg)



ii。具有给定组的最大大小的行的ID,



iii。组发生多少次(例如代码= 1,颜色=黑色,12次)



问题:什么是最快的方式做这个?我必须使用 apply()和专有功能?

解决方案

您可以传递要应用于组的功能列表,例如:

  grouping = df.groupby(['code ','color'])['size']。agg([np.sum,np.average,np.size,np.argmax])。reset_index()
pre>

由于 argmax 是最大行的索引,因此您需要在原始数据框上查找:

  grouping ['max_row_id'] = df.ix [grouping ['argmax']]。reset_index(grouping.index)注意:我选择了'size'列,因为所有的函数都适用于该列。如果您想对不同的列执行不同的功能集,可以使用 agg 与一个具有函数列表的字典。  agg({'size':[np.sum,np.average]})。这导致 MultiIndex 列,这意味着在获取每个组中最大大小的ID时,您需要执行以下操作:

  grouping ['max_row_id'] = df.ix [grouping ['size'] ['argmax']]。reset_index(grouping.index).id 


For the Pandas dataframe:

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)]
                    },  columns= ['id','code','colour', 'texture', 'size'])

I run the line below to get the aggregated sizes grouped by code and colour pairs:

grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>>     code colour  size
>> 0    one  black   987
>> 1    one  white   972
>> 2  three  black   972
>> 3  three  white   488
>> 4    two  black  1162
>> 5    two  white  1158
>> [6 rows x 3 columns]

In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:

i. average value (np.avg) per group

ii. the id of the row with the max size for a given group,

iii. how many times the group occured (e.g. code=one, colour=black, 12 times)

Question: What is the fastest way to do this? Would I have to use apply() and a proprietary function?

解决方案

You can pass a list of functions to be applied to the group, e.g.:

grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()

Since argmax is the index of the maximum row, you will need to look them up on the original dataframe:

grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id

NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use agg with a dictionary with a list of functions e.g. agg({'size': [np.sum, np.average]}). This results in MultiIndex columns, which means that when getting the IDs for the maximum size in each group you need to do:

grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id

这篇关于 pandas 数据帧GroupBy - 显示组统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆