带有花式堆叠的Pandas数据透视表 [英] Pandas Pivot Table with Fancy Stacking
问题描述
给出以下数据帧:
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
results[['Contractor','President','Item 1','Item 2']]
results
Contractor President Item 1 Item 2
0 1 1 1 1
1 1 0 1 0
2 0 0 0 0
3 0 0 0 1
4 0 1 1 0
5 1 1 NaN 1
,以供参考(请参见下文):
and this for reference to items (see below):
Position Item(s)
0 Contractor 1
1 President 1,2
...我想透视数据以产生此结果:
...I would like to pivot the data to produce this:
Position Overall%
0 Contractor 100
1 President 80
...基于此逻辑:
由于总统关心项目1和2,因此需要考虑5个数字:项目1的(1和1)和项目2的(1、0和1). /strong>为4,项目总数为5(不计算"NaN" ),占80%.
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (do not count 'NaN'), which gives 80%.
由于承包商仅关注第1项,因此需要考虑2个数字:1和1-不应计算"NaN"-(分别来自感兴趣的行).因此,总和是计数中的2,即2,即100%
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
提前谢谢!
推荐答案
import numpy as np
import pandas as pd
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
reference = pd.DataFrame({'Position':['Contractor','President'],
'Item(s)':[(1,), (1,2)]})
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)
收益
Position Overall%
0 Contractor 100.0
1 President 80.0
说明: Hadley Wickham 上有一篇文章 ( PDF )凸显了优势 使数据整洁".主要原则是每一行应 代表一个观察",每一列代表一些因素或变量.
Explanation: There is an article by Hadley Wickham (PDF) propounding the advantages of making data "tidy". The main tenet is that each row should represent an "observation" and each column represent some factor or variable.
经常发现表达计算所需的工具 一旦数据整齐,它将很自然地就位. 这个问题的困难主要来自数据不整洁.
It frequently turns out that the tools you will need to express your calculation will fall into place quite naturally once the data is tidy. The difficulty of this problem largely comes from the data not being tidy.
考虑results
:
In [405]: results
Out[405]:
Contractor Item 1 Item 2 President
0 1 1.0 1 1
1 1 1.0 0 0
2 0 0.0 0 0
3 0 0.0 1 0
4 0 1.0 0 1
5 1 NaN 1 1
与其在Contractor
和President
上没有单独的列,不如在Position
上增加一列,因为Position
是变量,并且每个观察或行可以为Position
一个值会更好. >-Contractor
或President
.
同样,应将Item 1
和Item 2
合并为单个列Item
:
Instead of having separate columns for Contractor
and President
, it would be nicer to have one column called Position
, since Position
is the variable, and each observation or row can have one value for Position
-- either a Contractor
or a President
.
Similarly, Item 1
and Item 2
should be coalesced into a single column Item
:
In [416]: melted
Out[416]:
Position Item value
0 Contractor Item 1 1.0
1 Contractor Item 1 1.0
2 Contractor Item 1 NaN
3 President Item 1 1.0
4 President Item 1 1.0
5 President Item 1 NaN
6 Contractor Item 2 1.0
7 Contractor Item 2 0.0
8 Contractor Item 2 1.0
9 President Item 2 1.0
10 President Item 2 0.0
11 President Item 2 1.0
melted
包含与results
相同的信息,但格式整齐. value
列包含results[['Item 1', 'Item 2']]
中的值.每行对应一个观察值",其中results['Contractor']
或result ['President']`等于1,因为计算逻辑仅需要这些值.
melted
contains the same information as results
, but in a tidy format. The value
column contains the values in results[['Item 1', 'Item 2']]
. Each row corresponds to an "observation" where either results['Contractor']
or result['President']` equals 1, since the calculation's logic only requires these values.
类似地,代替
In [407]: reference
Out[407]:
Item(s) Position
0 (1,) Contractor
1 (1, 2) President
拥有一个列为Item
和Position
的DataFrame会比较整齐:
it would be tidier to have a DataFrame whose columns are Item
and Position
:
In [408]: longref
Out[408]:
Item Position
0 Item 1 Contractor
1 Item 1 President
2 Item 2 President
一旦您以melted
和longref
的形式获得了整齐的数据,
计算所需的结果非常简单:
Once you have the tidy version of your data in the form of melted
and longref
,
calculating the desired result is fairly straight-forward:
merged = pd.merge(longref, melted, how='left')
# Item Position value
# 0 Item 1 Contractor 1.0
# 1 Item 1 Contractor 1.0
# 2 Item 1 Contractor NaN
# 3 Item 1 President 1.0
# 4 Item 1 President 1.0
# 5 Item 1 President NaN
# 6 Item 2 President 1.0
# 7 Item 2 President 0.0
# 8 Item 2 President 1.0
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
如何整理reference
制作longref
:
How to tidy-up reference
to make longref
:
只需遍历reference
的行,并为每一行遍历项的元组以构建新的DataFrame,longref
:
Just iterate through the rows of reference
and for each row iterate through the tuple of items to build the new DataFrame, longref
:
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
如何整理results
以制作melted
:
How to tidy-up results
to make melted
:
可以通过两次调用pd.melt
来完成. pd.melt
将宽"格式转换为长"格式DataFrame.它可以将多个列合并为一个列.例如,要将承包商"和总裁"列合并为一个职位"列,可以使用:
It can be done with two calls to pd.melt
. pd.melt
converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
# Item 1 Item 2 Position value
# 0 1.0 1 Contractor 1
# 1 1.0 0 Contractor 1
# 5 NaN 1 Contractor 1
# 6 1.0 1 President 1
# 10 1.0 0 President 1
# 11 NaN 1 President 1
并类似地,要将Item 1
和Item 2
列合并为单个Item
列,请使用:
and similarly, to coalesce the Item 1
and Item 2
columns into a single Item
column, use:
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
# Position Item value
# 0 Contractor Item 1 1.0
# 1 Contractor Item 1 1.0
# 2 Contractor Item 1 NaN
# 3 President Item 1 1.0
# 4 President Item 1 1.0
# 5 President Item 1 NaN
# 6 Contractor Item 2 1.0
# 7 Contractor Item 2 0.0
# 8 Contractor Item 2 1.0
# 9 President Item 2 1.0
# 10 President Item 2 0.0
# 11 President Item 2 1.0
这篇关于带有花式堆叠的Pandas数据透视表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!