如何透视数据框? [英] How to pivot a dataframe?
问题描述
- 什么是数据透视?
- 我该如何旋转?
- 这是枢轴吗?
- 长格式到宽格式?
我已经看到很多有关数据透视表的问题.即使他们不知道他们在询问数据透视表,通常也是如此.几乎不可能写出涵盖枢纽各个方面的规范问答.
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...
...但是我要尝试一下.
... But I'm going to give it a go.
现有问题和答案的问题在于,问题通常集中在OP难以推广的细微差别上,以便使用许多现有的良好答案.但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
从我的
- 在此问题中,OP与枢轴的输出有关.即列的外观. OP希望它看起来像R.这对熊猫用户不是很有帮助.
- 另一个不错的问题,但答案集中在一种方法上,即
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
因此,每当有人搜索pivot
时,他们都会得到零星的结果,很可能无法回答他们的特定问题.
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
您可能会注意到,我显眼地命名了我的列和相关的列值,以与我在下面的答案中的透视方式相对应.
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
问题
-
为什么我会得到
ValueError: Index contains duplicate entries, cannot reshape
如何旋转df
以使col
值是列,row
值是索引,而val0
的均值是值?
How do I pivot df
such that the col
values are columns, row
values are the index, and mean of val0
are the values?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
如何旋转df
,以使col
值是列,row
值是索引,val0
的均值是值,而缺少的值是0
?>
How do I pivot df
such that the col
values are columns, row
values are the index, mean of val0
are the values, and missing values are 0
?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
我可以得到mean
以外的其他东西吗,例如sum
?
Can I get something other than mean
, like maybe sum
?
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
我可以一次完成多个聚合吗?
Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
我可以汇总多个值列吗?
Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
可以细分为多列吗?
Can Subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
或
Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
我可以汇总列和行一起出现的频率,也就是交叉制表"吗?
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
如何通过仅旋转两列来将DataFrame从长转换为宽?已知,
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,
np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
df2
A B
0 a 0
1 a 11
2 a 2
3 a 11
4 b 10
5 b 10
6 b 14
7 c 7
预期效果应该类似于
a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
在pivot
来自
1 2
1 1 2
a 2 1 1
b 2 1 0
c 1 0 0
收件人
1|1 2|1 2|2
a 2 1 1
b 2 1 0
c 1 0 0
推荐答案
我们首先回答第一个问题:
We start by answering the first question:
为什么我会得到
ValueError: Index contains duplicate entries, cannot reshape
之所以会发生这种情况,是因为熊猫试图为具有重复条目的columns
或index
对象重新编制索引.有多种方法可以执行数据透视.当有人要求重复输入密钥时,其中某些方法不太适合.例如.考虑pd.DataFrame.pivot
.我知道有重复的条目共享row
和col
值:
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
所以当我pivot
使用
df.pivot(index='row', columns='col', values='val0')
我收到上面提到的错误.实际上,当我尝试使用以下命令执行相同的任务时,会出现相同的错误:
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
这是我们可以用来透视的成语列表
Here is a list of idioms we can use to pivot
-
pd.DataFrame.groupby
+pd.DataFrame.unstack
- 用于执行几乎所有类型的数据透视的良好通用方法
- 您指定一组将构成枢轴行级别和列级别的所有列.通过选择要聚合的其余列以及要执行聚合的功能,可以做到这一点.最后,
unstack
您要在列索引中显示的级别.
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
-
groupby
的美化版本,具有更直观的API.对于许多人来说,这是首选方法.这是开发人员想要的方法. - 指定行级别,列级别,要聚合的值以及执行聚合的功能.
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- 某些人方便直观(包括我自己).无法处理重复的分组密钥.
- 类似于
groupby
范例,我们指定最终将成为行或列级别的所有列,并将其设置为索引.然后,我们在列中unstack
我们想要的级别.如果其余索引级别或列级别都不唯一,则此方法将失败.
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
- 与
set_index
非常相似,因为它共享重复的密钥限制.该API也非常有限. 仅使用 - 类似于
pivot_table
方法,我们选择要在其上旋转的行,列和值.但是,我们无法聚合,并且如果行或列都不唯一,则此方法将失败.
index
,columns
,values
的标量值.
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- 这是
pivot_table
的专用版本,它以最纯粹的形式是执行多项任务的最直观的方式.
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- 这是一种非常先进的技术,它虽然晦涩难懂,但速度却很快.并非在所有情况下都可以使用它,但是只要可以使用它并且感觉舒适,您就会获得性能上的回报.
- 我用它来巧妙地进行交叉制表.
示例
对于每个后续答案和问题,我将使用pd.DataFrame.pivot_table
进行回答.然后,我将提供替代方法来执行相同的任务.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
如何旋转
df
以使col
值是列,row
值是索引,val0
的均值是值,而缺少的值是0
?
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
-
pd.DataFrame.pivot_table
-
默认情况下未设置
-
fill_value
.我倾向于适当地设置它.在这种情况下,我将其设置为0
.请注意,我跳过了问题2 ,因为它与没有fill_value
的答案相同
-
aggfunc='mean'
是默认设置,我不必设置它.我将其包括在内是为了明确.
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
-
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
我可以得到
mean
以外的其他东西,例如sum
吗?
Can I get something other than
mean
, like maybesum
?
-
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='sum') col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
-
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
-
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='sum').fillna(0)
-
pd.DataFrame.pivot_table
我一次可以做一次以上聚合吗?
Can I do more that one aggregation at a time?
请注意,对于pivot_table
和crosstab
,我需要传递可调用对象列表.另一方面,groupby.agg
能够为有限数量的特殊功能使用字符串. groupby.agg
也会采用我们传递给其他对象的相同的可调用对象,但是利用字符串函数名称通常会更有效,因为可以提高效率.
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
我可以汇总多个值列吗?
Can I aggregate over multiple value columns?
-
pd.DataFrame.pivot_table
我们通过了values=['val0', 'val1']
,但我们可以完全忽略它pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table( values=['val0', 'val1'], index='row', columns='col', fill_value=0, aggfunc='mean') val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
-
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
-
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
-
pd.DataFrame.groupby
df.groupby( ['row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
-
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index=['key', 'row'], columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
-
pd.DataFrame.groupby
df.groupby( ['key', 'row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
-
pd.DataFrame.set_index
,因为键集对于行和列都是唯一的 pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index( ['key', 'row', 'item', 'col'] )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
-
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
-
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
-
pd.crosstab
pd.crosstab(df['row'], df['col'])
-
pd.factorize
+np.bincount
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r` # for column `'row'` i, r = pd.factorize(df['row'].values) # get integer factorization `j` and unique values `c` # for column `'col'` j, c = pd.factorize(df['col'].values) # `n` will be the number of rows # `m` will be the number of columns n, m = r.size, c.size # `i * m + j` is a clever way of counting the # factorization bins assuming a flat array of length # `n * m`. Which is why we subsequently reshape as `(n, m)` b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) # BTW, whenever I read this, I think 'Bean, Rice, and Cheese' pd.DataFrame(b, r, c) col3 col2 col0 col1 col4 row3 2 0 0 1 0 row2 1 2 1 0 2 row0 1 0 1 2 1 row4 2 2 0 1 1
-
pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col'])) col0 col1 col2 col3 col4 row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
可以细分为多列吗?
Can Subdivide by multiple columns?
可以细分为多列吗?
Can Subdivide by multiple columns?
我可以汇总列和行一起出现的频率,又称为交叉表"吗?
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
如何通过仅旋转两个将DataFrame从长转换为宽 列?
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?
第一步是为每行分配一个数字-该数字将成为透视结果中该值的行索引.使用
GroupBy.cumcount
:The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using
GroupBy.cumcount
:df2.insert(0, 'count', df.groupby('A').cumcount()) df2 count A B 0 0 a 0 1 1 a 11 2 2 a 2 3 3 a 11 4 0 b 10 5 1 b 10 6 2 b 14 7 0 c 7
The second step is to use the newly created column as the index to call
DataFrame.pivot
.df2.pivot(*df) # df.pivot(index='count', columns='A', values='B') A a b c count 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
问题11
如何在
pivot
如果
columns
使用字符串join
键入object
df.columns = df.columns.map('|'.join)
其他
format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
这篇关于如何透视数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!