带有花式堆叠的Pandas数据透视表 [英] Pandas Pivot Table with Fancy Stacking

查看:87
本文介绍了带有花式堆叠的Pandas数据透视表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下数据帧:

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
results[['Contractor','President','Item 1','Item 2']]

results

    Contractor  President   Item 1  Item 2
0        1          1       1       1
1        1          0       1       0
2        0          0       0       0
3        0          0       0       1
4        0          1       1       0
5        1          1       NaN     1

,以供参考(请参见下文):

and this for reference to items (see below):

    Position    Item(s)
0   Contractor  1
1   President   1,2

...我想透视数据以产生此结果:

...I would like to pivot the data to produce this:

    Position    Overall%
0   Contractor  100
1   President   80

...基于此逻辑:

由于总统关心项目1和2,因此需要考虑5个数字:项目1的(1和1)和项目2的(1、0和1). /strong>为4,项目总数为5(不计算"NaN" ),占80%.

Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (do not count 'NaN'), which gives 80%.

由于承包商仅关注第1项,因此需要考虑2个数字:1和1-不应计算"NaN"-(分别来自感兴趣的行).因此,总和是计数中的2,即2,即100%

Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%

提前谢谢!

推荐答案

import numpy as np
import pandas as pd

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
reference =  pd.DataFrame({'Position':['Contractor','President'],
                           'Item(s)':[(1,), (1,2)]})

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
                        for index, row in reference.iterrows() 
                        for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'], 
                 value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)

收益

     Position  Overall%
0  Contractor     100.0
1   President      80.0


说明: Hadley Wickham 上有一篇文章 ( PDF )凸显了优势 使数据整洁".主要原则是每一行应 代表一个观察",每一列代表一些因素或变量.


Explanation: There is an article by Hadley Wickham (PDF) propounding the advantages of making data "tidy". The main tenet is that each row should represent an "observation" and each column represent some factor or variable.

经常发现表达计算所需的工具 一旦数据整齐,它将很自然地就位. 这个问题的困难主要来自数据不整洁.

It frequently turns out that the tools you will need to express your calculation will fall into place quite naturally once the data is tidy. The difficulty of this problem largely comes from the data not being tidy.

考虑results:

In [405]: results
Out[405]: 
   Contractor  Item 1  Item 2  President
0           1     1.0       1          1
1           1     1.0       0          0
2           0     0.0       0          0
3           0     0.0       1          0
4           0     1.0       0          1
5           1     NaN       1          1

与其在ContractorPresident上没有单独的列,不如在Position上增加一列,因为Position是变量,并且每个观察或行可以为Position一个值会更好. >-ContractorPresident. 同样,应将Item 1Item 2合并为单个列Item:

Instead of having separate columns for Contractor and President, it would be nicer to have one column called Position, since Position is the variable, and each observation or row can have one value for Position -- either a Contractor or a President. Similarly, Item 1 and Item 2 should be coalesced into a single column Item:

In [416]: melted
Out[416]: 
      Position    Item  value
0   Contractor  Item 1    1.0
1   Contractor  Item 1    1.0
2   Contractor  Item 1    NaN
3    President  Item 1    1.0
4    President  Item 1    1.0
5    President  Item 1    NaN
6   Contractor  Item 2    1.0
7   Contractor  Item 2    0.0
8   Contractor  Item 2    1.0
9    President  Item 2    1.0
10   President  Item 2    0.0
11   President  Item 2    1.0

melted包含与results相同的信息,但格式整齐. value列包含results[['Item 1', 'Item 2']]中的值.每行对应一个观察值",其中results['Contractor']或result ['President']`等于1,因为计算逻辑仅需要这些值.

melted contains the same information as results, but in a tidy format. The value column contains the values in results[['Item 1', 'Item 2']]. Each row corresponds to an "observation" where either results['Contractor'] or result['President']` equals 1, since the calculation's logic only requires these values.

类似地,代替

In [407]: reference
Out[407]: 
  Item(s)    Position
0    (1,)  Contractor
1  (1, 2)   President

拥有一个列为ItemPosition的DataFrame会比较整齐:

it would be tidier to have a DataFrame whose columns are Item and Position:

In [408]: longref
Out[408]: 
     Item    Position
0  Item 1  Contractor
1  Item 1   President
2  Item 2   President

一旦您以meltedlongref的形式获得了整齐的数据, 计算所需的结果非常简单:

Once you have the tidy version of your data in the form of melted and longref, calculating the desired result is fairly straight-forward:

merged = pd.merge(longref, melted, how='left')
#      Item    Position  value
# 0  Item 1  Contractor    1.0
# 1  Item 1  Contractor    1.0
# 2  Item 1  Contractor    NaN
# 3  Item 1   President    1.0
# 4  Item 1   President    1.0
# 5  Item 1   President    NaN
# 6  Item 2   President    1.0
# 7  Item 2   President    0.0
# 8  Item 2   President    1.0

grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()


如何整理reference制作longref :


How to tidy-up reference to make longref:

只需遍历reference的行,并为每一行遍历项的元组以构建新的DataFrame,longref:

Just iterate through the rows of reference and for each row iterate through the tuple of items to build the new DataFrame, longref:

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
                        for index, row in reference.iterrows() 
                        for item in row['Item(s)']], columns=['Item', 'Position'])


如何整理results以制作melted :


How to tidy-up results to make melted:

可以通过两次调用pd.melt来完成. pd.melt将宽"格式转换为长"格式DataFrame.它可以将多个列合并为一个列.例如,要将承包商"和总裁"列合并为一个职位"列,可以使用:

It can be done with two calls to pd.melt. pd.melt converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:

melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
#     Item 1  Item 2    Position  value
# 0      1.0       1  Contractor      1
# 1      1.0       0  Contractor      1
# 5      NaN       1  Contractor      1
# 6      1.0       1   President      1
# 10     1.0       0   President      1
# 11     NaN       1   President      1

并类似地,要将Item 1Item 2列合并为单个Item列,请使用:

and similarly, to coalesce the Item 1 and Item 2 columns into a single Item column, use:

melted = pd.melt(melted, id_vars=['Position'], 
                 value_vars=['Item 1','Item 2'], var_name='Item')
#       Position    Item  value
# 0   Contractor  Item 1    1.0
# 1   Contractor  Item 1    1.0
# 2   Contractor  Item 1    NaN
# 3    President  Item 1    1.0
# 4    President  Item 1    1.0
# 5    President  Item 1    NaN
# 6   Contractor  Item 2    1.0
# 7   Contractor  Item 2    0.0
# 8   Contractor  Item 2    1.0
# 9    President  Item 2    1.0
# 10   President  Item 2    0.0
# 11   President  Item 2    1.0

这篇关于带有花式堆叠的Pandas数据透视表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆