如何透视数据框? [英] How to pivot a dataframe?

查看：86 发布时间：2020/11/21 0:08:52 python pandas group-by pivot pandas-groupby

本文介绍了如何透视数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

什么是数据透视?
我该如何旋转?
这是枢轴吗?
长格式到宽格式?

我已经看到很多有关数据透视表的问题.即使他们不知道他们在询问数据透视表，通常也是如此.几乎不可能写出涵盖枢纽各个方面的规范问答.

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

...但是我要尝试一下.

... But I'm going to give it a go.

现有问题和答案的问题在于，问题通常集中在OP难以推广的细微差别上，以便使用许多现有的良好答案.但是，没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

从我的

如何在Pandas中透视数据框?

很好的问答.但是答案只回答了很少的解释.

熊猫数据透视表到数据框

在此问题中，OP与枢轴的输出有关.即列的外观. OP希望它看起来像R.这对熊猫用户不是很有帮助.

旋转数据框的熊猫，重复的行

另一个不错的问题，但答案集中在一种方法上，即pd.DataFrame.pivot

Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

因此，每当有人搜索pivot时，他们都会得到零星的结果，很可能无法回答他们的特定问题.

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.

您可能会注意到，我显眼地命名了我的列和相关的列值，以与我在下面的答案中的透视方式相对应.

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

问题

为什么我会得到ValueError: Index contains duplicate entries, cannot reshape

如何旋转df以使col值是列，row值是索引，而val0的均值是值?

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

 col   col0   col1   col2   col3  col4
 row                                  
 row0  0.77  0.605    NaN  0.860  0.65
 row2  0.13    NaN  0.395  0.500  0.25
 row3   NaN  0.310    NaN  0.545   NaN
 row4   NaN  0.100  0.395  0.760  0.24

如何旋转df，以使col值是列，row值是索引，val0的均值是值，而缺少的值是0?

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

 col   col0   col1   col2   col3  col4
 row                                  
 row0  0.77  0.605  0.000  0.860  0.65
 row2  0.13  0.000  0.395  0.500  0.25
 row3  0.00  0.310  0.000  0.545  0.00
 row4  0.00  0.100  0.395  0.760  0.24

我可以得到mean以外的其他东西吗，例如sum?

Can I get something other than mean, like maybe sum?

 col   col0  col1  col2  col3  col4
 row                               
 row0  0.77  1.21  0.00  0.86  0.65
 row2  0.13  0.00  0.79  0.50  0.50
 row3  0.00  0.31  0.00  1.09  0.00
 row4  0.00  0.10  0.79  1.52  0.24

我可以一次完成多个聚合吗?

Can I do more that one aggregation at a time?

        sum                          mean                           
 col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
 row                                                                
 row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
 row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
 row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
 row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24

我可以汇总多个值列吗?

Can I aggregate over multiple value columns?

       val0                             val1                          
 col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
 row                                                                  
 row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
 row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
 row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
 row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

可以细分为多列吗?

Can Subdivide by multiple columns?

 item item0             item1                         item2                   
 col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
 row                                                                          
 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
 row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
 row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
 row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

 item      item0             item1                         item2                  
 col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
 key  row                                                                         
 key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
      row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
      row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
 key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
      row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
      row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
      row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
 key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
      row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
      row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

我可以汇总列和行一起出现的频率，也就是交叉制表"吗?

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

 col   col0  col1  col2  col3  col4
 row                               
 row0     1     2     0     1     1
 row2     1     0     2     1     2
 row3     0     1     0     2     0
 row4     0     1     2     2     1

如何通过仅旋转两列来将DataFrame从长转换为宽?已知，

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
df2        
   A   B
0  a   0
1  a  11
2  a   2
3  a  11
4  b  10
5  b  10
6  b  14
7  c   7

预期效果应该类似于

      a     b    c
0   0.0  10.0  7.0
1  11.0  10.0  NaN
2   2.0  14.0  NaN
3  11.0   NaN  NaN

在pivot

来自

收件人

   1|1  2|1  2|2               
a    2    1    1
b    2    1    0
c    1    0    0

推荐答案

我们首先回答第一个问题:

We start by answering the first question:

为什么我会得到ValueError: Index contains duplicate entries, cannot reshape

之所以会发生这种情况，是因为熊猫试图为具有重复条目的columns或index对象重新编制索引.有多种方法可以执行数据透视.当有人要求重复输入密钥时，其中某些方法不太适合.例如.考虑pd.DataFrame.pivot.我知道有重复的条目共享row和col值:

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

所以当我pivot使用

df.pivot(index='row', columns='col', values='val0')

我收到上面提到的错误.实际上，当我尝试使用以下命令执行相同的任务时，会出现相同的错误:

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

这是我们可以用来透视的成语列表

Here is a list of idioms we can use to pivot

pd.DataFrame.groupby + pd.DataFrame.unstack

用于执行几乎所有类型的数据透视的良好通用方法

您指定一组将构成枢轴行级别和列级别的所有列.通过选择要聚合的其余列以及要执行聚合的功能，可以做到这一点.最后，unstack您要在列索引中显示的级别.

pd.DataFrame.groupby + pd.DataFrame.unstack

Good general approach for doing just about any type of pivot

You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.

groupby的美化版本，具有更直观的API.对于许多人来说，这是首选方法.这是开发人员想要的方法.

指定行级别，列级别，要聚合的值以及执行聚合的功能.

A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.

Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.

某些人方便直观(包括我自己).无法处理重复的分组密钥.

类似于groupby范例，我们指定最终将成为行或列级别的所有列，并将其设置为索引.然后，我们在列中unstack我们想要的级别.如果其余索引级别或列级别都不唯一，则此方法将失败.

Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.

Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.

与set_index非常相似，因为它共享重复的密钥限制.该API也非常有限.
仅使用index，columns，values的标量值.
类似于pivot_table方法，我们选择要在其上旋转的行，列和值.但是，我们无法聚合，并且如果行或列都不唯一，则此方法将失败.

Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.

Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.

这是pivot_table的专用版本，它以最纯粹的形式是执行多项任务的最直观的方式.

This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.

这是一种非常先进的技术，它虽然晦涩难懂，但速度却很快.并非在所有情况下都可以使用它，但是只要可以使用它并且感觉舒适，您就会获得性能上的回报.

我用它来巧妙地进行交叉制表.

示例

对于每个后续答案和问题，我将使用pd.DataFrame.pivot_table进行回答.然后，我将提供替代方法来执行相同的任务.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

如何旋转df以使col值是列，row值是索引，val0的均值是值，而缺少的值是0?

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

pd.DataFrame.pivot_table

默认情况下未设置
fill_value.我倾向于适当地设置它.在这种情况下，我将其设置为0.请注意，我跳过了问题2 ，因为它与没有fill_value
的答案相同

aggfunc='mean'是默认设置，我不必设置它.我将其包括在内是为了明确.

fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value

aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)

pd.crosstab

pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='mean').fillna(0)

我可以得到mean以外的其他东西，例如sum吗?

Can I get something other than mean, like maybe sum?

pd.DataFrame.pivot_table

df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='sum') col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)

pd.crosstab

pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='sum').fillna(0)

我一次可以做一次以上聚合吗?

Can I do more that one aggregation at a time?

请注意，对于pivot_table和crosstab，我需要传递可调用对象列表.另一方面，groupby.agg能够为有限数量的特殊功能使用字符串. groupby.agg也会采用我们传递给其他对象的相同的可调用对象，但是利用字符串函数名称通常会更有效，因为可以提高效率.

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

pd.DataFrame.pivot_table

df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc=[np.size, np.mean]) size mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65 row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25 row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00 row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)

pd.crosstab

pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

我可以汇总多个值列吗?

Can I aggregate over multiple value columns?

pd.DataFrame.pivot_table我们通过了values=['val0', 'val1']，但我们可以完全忽略它

pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

df.pivot_table( values=['val0', 'val1'], index='row', columns='col', fill_value=0, aggfunc='mean') val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

可以细分为多列吗?

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table( values='val0', index='row', columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00

pd.DataFrame.groupby

df.groupby( ['row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

可以细分为多列吗?

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

df.pivot_table( values='val0', index=['key', 'row'], columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00

pd.DataFrame.groupby

df.groupby( ['key', 'row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

pd.DataFrame.set_index，因为键集对于行和列都是唯一的

pd.DataFrame.set_index because the set of keys are unique for both rows and columns

df.set_index( ['key', 'row', 'item', 'col'] )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

我可以汇总列和行一起出现的频率，又称为交叉表"吗?

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

pd.DataFrame.pivot_table

df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1

pd.DataFrame.groupby

df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)

pd.crosstab

pd.crosstab(df['row'], df['col'])

pd.factorize + np.bincount

pd.factorize + np.bincount

# get integer factorization `i` and unique values `r` # for column `'row'` i, r = pd.factorize(df['row'].values) # get integer factorization `j` and unique values `c` # for column `'col'` j, c = pd.factorize(df['col'].values) # `n` will be the number of rows # `m` will be the number of columns n, m = r.size, c.size # `i * m + j` is a clever way of counting the # factorization bins assuming a flat array of length # `n * m`. Which is why we subsequently reshape as `(n, m)` b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) # BTW, whenever I read this, I think 'Bean, Rice, and Cheese' pd.DataFrame(b, r, c) col3 col2 col0 col1 col4 row3 2 0 0 1 0 row2 1 2 1 0 2 row0 1 0 1 2 1 row4 2 2 0 1 1

pd.get_dummies

pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col'])) col0 col1 col2 col3 col4 row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1

如何通过仅旋转两个将DataFrame从长转换为宽列?

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

第一步是为每行分配一个数字-该数字将成为透视结果中该值的行索引.使用 GroupBy.cumcount :

The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

df2.insert(0, 'count', df.groupby('A').cumcount()) df2 count A B 0 0 a 0 1 1 a 11 2 2 a 2 3 3 a 11 4 0 b 10 5 1 b 10 6 2 b 14 7 0 c 7

第二步是使用新创建的列作为索引来调用

The second step is to use the newly created column as the index to call DataFrame.pivot.

df2.pivot(*df) # df.pivot(index='count', columns='A', values='B') A a b c count 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN

问题11

如何在pivot

如果columns使用字符串join键入object

df.columns = df.columns.map('|'.join)

其他format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)

这篇关于如何透视数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何透视数据框? [英] How to pivot a dataframe?

问题描述

问题

推荐答案

示例

Examples

问题11

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何透视数据框? [英] How to pivot a dataframe?

问题描述

问题

推荐答案

示例

Examples

问题11

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭