Pandas Pivot_Table:非数字值的行计算百分比 [英] Pandas Pivot_Table : Percentage of row calculation for non-numeric values

查看:574
本文介绍了Pandas Pivot_Table:非数字值的行计算百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在数据框"df"中的数据:

This is my DATA in dataframe "df":

Document    Name    Time
SPS2315511  A   1 HOUR
SPS2315512  B   1 - 2 HOUR
SPS2315513  C   2 - 3 HOUR
SPS2315514  C   1 HOUR
SPS2315515  B   1 HOUR
SPS2315516  A   2 - 3 HOUR
SPS2315517  A   1 - 2 HOUR

我正在使用以下代码,该代码为我提供了数据透视表中计数的摘要,

I am using the below code which gives me the summary of count in the pivot table,

table = pivot_table(df, values=["Document"],
                    index=["Name"], columns=["Time"],
                    aggfunc=lambda x: len(x),
                    margins=True, dropna=True)

但是我想要的是当您右键单击支点并选择将值显示为->%的行总数"时,按Excel支点的方式计算行的百分比.由于我的文档是非数字值,所以我无法获取它.

but what i want is the % of row calculation as in excel pivot when you right click the pivot and select "show value as -> % of Row Total" . Since my Document is a non-numeric value i was not able to get it.

Count of Document   Column Labels

Name    1 HOUR  1 - 2 HOUR  2 - 3 HOUR  Grand Total
A   33.33%  33.33%  33.33%  100.00%
B   50.00%  50.00%  0.00%   100.00%
C   50.00%  0.00%   50.00%  100.00%
Grand Total 42.86%  28.57%  28.57%  100.00%

任何人都可以帮助我找出一种获得此结果的方法吗?

Can any one please help me figure out a way to get this result??

我正在尝试操纵数据透视表数据,这将给我行总数,而不是数据帧中的数据,而我想要的是行总数的百分比".而且最重要的是,我所有的数据都是非数字值...

i am trying to manipulate the pivot data which will give me the row total,not the data from the dataframe and what i wanted is "% of row total". And also most importantly all my data are non-numeric values...

推荐答案

@maxymoo指出的可能重复项非常接近解决方案,但是我将继续将其写为答案,因为有两个差异并不完全直接.

The possible duplicate noted by @maxymoo is pretty close to a solution, but I'll go ahead and write it up as an answer since there are a couple of differences that are not completely straightforward.

table = pd.pivot_table(df, values=["Document"],
                       index=["Name"], columns=["Time"], 
                       aggfunc=len, margins=True, 
                       dropna=True, fill_value=0)

       Document                      
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name                                 
A             1      1          1   3
B             1      1          0   2
C             0      1          1   2
All           2      3          2   7

主要调整是添加fill_value=0,因为您真正想要的是计数值零而不是NaN.

The main tweak there is to add fill_value=0 because what you really want there is a count value of zero, not a NaN.

然后,您基本上可以使用链接到的解决方案@maxymoo,但是您需要使用iloc或类似的b/c,所以表列现在有点复杂(由于数据透视表的多索引结果). /p>

Then you can basically use the solution @maxymoo linked to, but you need to use iloc or similar b/c the table columns are a little complicated now (being a multi-indexed result of the pivot table).

table2 = table.div( table.iloc[:,-1], axis=0 )

       Document                         
Time 1 - 2 HOUR    1 HOUR 2 - 3 HOUR All
Name                                    
A      0.333333  0.333333   0.333333   1
B      0.500000  0.500000   0.000000   1
C      0.000000  0.500000   0.500000   1
All    0.285714  0.428571   0.285714   1

您仍然需要执行一些次要的格式化工作(翻转第一列和第二列并转换为%),但这是您要查找的数字.

You've still got some minor formatting work to do there (flip first and second columns and convert to %), but those are the numbers you are looking for.

顺便说一句,这里没有必要,但是您可能要考虑将时间"转换为有序的分类变量,这将是解决列排序问题的一种方法(我认为),但可能值得也可能不值得麻烦取决于您对数据进行的其他操作.

Btw, it's not necessary here, but you might want to think about converting 'Time' to an ordered categorical variable, which would be one way to solve the column ordering problem (I think), but may or may not be worth the bother depending on what else you are doing with the data.

这篇关于Pandas Pivot_Table:非数字值的行计算百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆