如何按范围分组数据? [英] How to group data by ranges?

查看:248
本文介绍了如何按范围分组数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫数据框(只是一小部分):

GROUP   AVG_PERCENT_EVAL_1  AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   19                  11.000000               25.000000           163.000000
AAAAA   22                  2.000000                146.364198          332.761317
AAAAA   23                  0.500000                44.068225           302.708639
AAAAA   24                  1.000000                122.672215          322.359795
AAAAA   26                  1.000000                143.594896          317.940989
BBBBB   18                  1.000000                121.225692          319.292226
BBBBB   19                  1.000000                40.054707           201.096152
BBBBB   21                  0.333333                29.221458           207.142059
BBBBB   27                  2.000000                103.796290          313.685358

我需要使AVG_PERCENT_EVAL_1更加连续,这意味着,而不是确切的值,例如181920等,我想放置范围,例如18-2021-23等,直到40岁左右.

应按GROUP对数据进行分组,然后应将每个相应范围的AVG_PERCENT_NEGATIVEAVG_TOTAL_WAIT_TIMEAVG_TOTAL_SERVICE_TIME平均.

重要提示:让我们以范围18-20为例.组AAA具有对应的条目,其中AVG_PERCENT_EVAL_1等于19,而组BBBBB具有两个处于此范围内的条目-1819. AVG_PERCENT_NEGATIVEAVG_TOTAL_WAIT_TIMEAVG_TOTAL_SERVICE_TIME的值应为以下值:

GROUP   AVG_PERCENT_RANGE   AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   18-20               11.00                   25.00               163.000000
BBBBB   18-20               1.00                    80.64               260,19

我知道如何按具体列对数据进行分组,然后使用agg计算平均值"或计数".但是,在这种情况下,我不知道如何为AVG_PERCENT_EVAL_1创建范围.另外,我不知道如何定义缺少的条目不应被视为0.例如,在上面的示例中,没有有关AAAAAAVG_PERCENT_EVAL_1等于1820的信息,因此我只想取19 的值而没有求平均值18200值.

解决方案

步骤:

1)使用的包容性,而right=False将使右端点成为开放间隔")".

2)使用返回的类别,根据需要重新标记它们.

3)Peform groupby GROUP 和新计算的合并范围作为分组键,在从其中删除 AVG_PERCENT_EVAL_1 后汇总所有当前列的均值. /p>


合并部分:

step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels

分配和groupby.agg():

df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')

I have the following pandas dataframe (it is just a small extract):

GROUP   AVG_PERCENT_EVAL_1  AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   19                  11.000000               25.000000           163.000000
AAAAA   22                  2.000000                146.364198          332.761317
AAAAA   23                  0.500000                44.068225           302.708639
AAAAA   24                  1.000000                122.672215          322.359795
AAAAA   26                  1.000000                143.594896          317.940989
BBBBB   18                  1.000000                121.225692          319.292226
BBBBB   19                  1.000000                40.054707           201.096152
BBBBB   21                  0.333333                29.221458           207.142059
BBBBB   27                  2.000000                103.796290          313.685358

I need to make AVG_PERCENT_EVAL_1 more continuous, which means that, instead of exact values e.g. 18, 19, 20, etc., I want to put ranges e.g. 18-20,21-23, and so on till around 40.

The data should be grouped by GROUP and then AVG_PERCENT_NEGATIVE, AVG_TOTAL_WAIT_TIME and AVG_TOTAL_SERVICE_TIME should be averaged for each corresponding range.

IMPORTANT: Let's take the range 18-20. The group AAA has corresponding entry with AVG_PERCENT_EVAL_1 equal to 19, while the group BBBBB has two entries falling in this range - 18 and 19. The values of AVG_PERCENT_NEGATIVE, AVG_TOTAL_WAIT_TIME and AVG_TOTAL_SERVICE_TIME should be these ones:

GROUP   AVG_PERCENT_RANGE   AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   18-20               11.00                   25.00               163.000000
BBBBB   18-20               1.00                    80.64               260,19

I know how to group data by concrete columns and then calculate 'mean' or 'count' using agg. However, in this case I don't know how to create ranges for AVG_PERCENT_EVAL_1. Also, I don't know how to define that lacking entries should not be considered as 0. For instance, in the above example there is no information about AAAAA with AVG_PERCENT_EVAL_1 equal to 18 and 20, therefore I just want to take the values for 19 without averaging with 0 values for 18 and 20.

解决方案

Steps:

1) Bin AVG_PERCENT_EVAL_1 into appropriate labels using pd.cut() by specifying a bin sequence.

Specifying include_lowest=True would take care of inclusiveness of the left endpoint "[" whereas right=False would make the right endpoint an open interval ")".

2) Using the returned categories, re-label them as per desired requirements.

3) Peform groupby making GROUP and the newly computed binned ranges as the grouped key, aggregate the means of all present columns after dropping AVG_PERCENT_EVAL_1 from them.


binning portion:

step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels

assign and groupby.agg():

df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')

这篇关于如何按范围分组数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆