如何透视数据框? [英] How to pivot a dataframe?

查看:86
本文介绍了如何透视数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 什么是数据透视?
  • 我该如何旋转?
  • 这是枢轴吗?
  • 长格式到宽格式?

我已经看到很多有关数据透视表的问题.即使他们不知道他们在询问数据透视表,通常也是如此.几乎不可能写出涵盖枢纽各个方面的规范问答.

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

...但是我要尝试一下.

... But I'm going to give it a go.

现有问题和答案的问题在于,问题通常集中在OP难以推广的细微差别上,以便使用许多现有的良好答案.但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

从我的

  • 如何在Pandas中透视数据框?
    • 很好的问答.但是答案只回答了很少的解释.
      1. 熊猫数据透视表到数据框

      • 在此问题中,OP与枢轴的输出有关.即列的外观. OP希望它看起来像R.这对熊猫用户不是很有帮助.
        1. 旋转数据框的熊猫,重复的行

        • 另一个不错的问题,但答案集中在一种方法上,即pd.DataFrame.pivot
          • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot
          • 因此,每当有人搜索pivot时,他们都会得到零星的结果,很可能无法回答他们的特定问题.

            So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.

            您可能会注意到,我显眼地命名了我的列和相关的列值,以与我在下面的答案中的透视方式相对应.

            You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

            import numpy as np
            import pandas as pd
            from numpy.core.defchararray import add
            
            np.random.seed([3,1415])
            n = 20
            
            cols = np.array(['key', 'row', 'item', 'col'])
            arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
            
            df = pd.DataFrame(
                add(cols, arr1), columns=cols
            ).join(
                pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
            )
            print(df)
            
                 key   row   item   col  val0  val1
            0   key0  row3  item1  col3  0.81  0.04
            1   key1  row2  item1  col2  0.44  0.07
            2   key1  row0  item1  col0  0.77  0.01
            3   key0  row4  item0  col2  0.15  0.59
            4   key1  row0  item2  col1  0.81  0.64
            5   key1  row2  item2  col4  0.13  0.88
            6   key2  row4  item1  col3  0.88  0.39
            7   key1  row4  item1  col1  0.10  0.07
            8   key1  row0  item2  col4  0.65  0.02
            9   key1  row2  item0  col2  0.35  0.61
            10  key2  row0  item2  col1  0.40  0.85
            11  key2  row4  item1  col2  0.64  0.25
            12  key0  row2  item2  col3  0.50  0.44
            13  key0  row4  item1  col4  0.24  0.46
            14  key1  row3  item2  col3  0.28  0.11
            15  key0  row3  item1  col1  0.31  0.23
            16  key0  row0  item2  col3  0.86  0.01
            17  key0  row4  item0  col3  0.64  0.21
            18  key2  row2  item2  col0  0.13  0.45
            19  key0  row2  item0  col4  0.37  0.70
            

            问题

            1. 为什么我会得到ValueError: Index contains duplicate entries, cannot reshape

            如何旋转df以使col值是列,row值是索引,而val0的均值是值?

            How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

             col   col0   col1   col2   col3  col4
             row                                  
             row0  0.77  0.605    NaN  0.860  0.65
             row2  0.13    NaN  0.395  0.500  0.25
             row3   NaN  0.310    NaN  0.545   NaN
             row4   NaN  0.100  0.395  0.760  0.24
            

          • 如何旋转df,以使col值是列,row值是索引,val0的均值是值,而缺少的值是0?

          • How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

             col   col0   col1   col2   col3  col4
             row                                  
             row0  0.77  0.605  0.000  0.860  0.65
             row2  0.13  0.000  0.395  0.500  0.25
             row3  0.00  0.310  0.000  0.545  0.00
             row4  0.00  0.100  0.395  0.760  0.24
            

          • 我可以得到mean以外的其他东西吗,例如sum?

          • Can I get something other than mean, like maybe sum?

             col   col0  col1  col2  col3  col4
             row                               
             row0  0.77  1.21  0.00  0.86  0.65
             row2  0.13  0.00  0.79  0.50  0.50
             row3  0.00  0.31  0.00  1.09  0.00
             row4  0.00  0.10  0.79  1.52  0.24
            

          • 我可以一次完成多个聚合吗?

          • Can I do more that one aggregation at a time?

                    sum                          mean                           
             col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
             row                                                                
             row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
             row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
             row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
             row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
            

          • 我可以汇总多个值列吗?

          • Can I aggregate over multiple value columns?

                   val0                             val1                          
             col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
             row                                                                  
             row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
             row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
             row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
             row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
            

          • 可以细分为多列吗?

          • Can Subdivide by multiple columns?

             item item0             item1                         item2                   
             col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
             row                                                                          
             row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
             row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
             row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
             row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
            

          • Or

             item      item0             item1                         item2                  
             col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
             key  row                                                                         
             key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
                  row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
                  row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
                  row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
             key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
                  row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
                  row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
                  row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
             key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
                  row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
                  row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
            

          • 我可以汇总列和行一起出现的频率,也就是交叉制表"吗?

          • Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

             col   col0  col1  col2  col3  col4
             row                               
             row0     1     2     0     1     1
             row2     1     0     2     1     2
             row3     0     1     0     2     0
             row4     0     1     2     2     1
            

          • 如何通过仅旋转两列来将DataFrame从长转换为宽?已知,

          • How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

            np.random.seed([3, 1415])
            df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
            df2        
               A   B
            0  a   0
            1  a  11
            2  a   2
            3  a  11
            4  b  10
            5  b  10
            6  b  14
            7  c   7
            

            预期效果应该类似于

                  a     b    c
            0   0.0  10.0  7.0
            1  11.0  10.0  NaN
            2   2.0  14.0  NaN
            3  11.0   NaN  NaN
            

          • pivot

            来自

               1  2   
               1  1  2        
            a  2  1  1
            b  2  1  0
            c  1  0  0
            

            收件人

               1|1  2|1  2|2               
            a    2    1    1
            b    2    1    0
            c    1    0    0
            

          • 推荐答案

            我们首先回答第一个问题:

            We start by answering the first question:

            为什么我会得到ValueError: Index contains duplicate entries, cannot reshape

            之所以会发生这种情况,是因为熊猫试图为具有重复条目的columnsindex对象重新编制索引.有多种方法可以执行数据透视.当有人要求重复输入密钥时,其中某些方法不太适合.例如.考虑pd.DataFrame.pivot.我知道有重复的条目共享rowcol值:

            This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

            df.duplicated(['row', 'col']).any()
            
            True
            

            所以当我pivot使用

            df.pivot(index='row', columns='col', values='val0')
            

            我收到上面提到的错误.实际上,当我尝试使用以下命令执行相同的任务时,会出现相同的错误:

            I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

            df.set_index(['row', 'col'])['val0'].unstack()
            

            这是我们可以用来透视的成语列表

            Here is a list of idioms we can use to pivot

            1. pd.DataFrame.groupby + pd.DataFrame.unstack
              • 用于执行几乎所有类型的数据透视的良好通用方法
              • 您指定一组将构成枢轴行级别和列级别的所有列.通过选择要聚合的其余列以及要执行聚合的功能,可以做到这一点.最后,unstack您要在列索引中显示的级别.
            1. pd.DataFrame.groupby + pd.DataFrame.unstack
              • Good general approach for doing just about any type of pivot
              • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
            • groupby的美化版本,具有更直观的API.对于许多人来说,这是首选方法.这是开发人员想要的方法.
            • 指定行级别,列级别,要聚合的值以及执行聚合的功能.
            • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
            • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
            • 某些人方便直观(包括我自己).无法处理重复的分组密钥.
            • 类似于groupby范例,我们指定最终将成为行或列级别的所有列,并将其设置为索引.然后,我们在列中unstack我们想要的级别.如果其余索引级别或列级别都不唯一,则此方法将失败.
            • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
            • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
            • set_index非常相似,因为它共享重复的密钥限制.该API也非常有限.
            • 仅使用indexcolumnsvalues的标量值.
            • 类似于pivot_table方法,我们选择要在其上旋转的行,列和值.但是,我们无法聚合,并且如果行或列都不唯一,则此方法将失败.
            • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
            • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
            • 这是pivot_table的专用版本,它以最纯粹的形式是执行多项任务的最直观的方式.
            • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.
            • 这是一种非常先进的技术,它虽然晦涩难懂,但速度却很快.并非在所有情况下都可以使用它,但是只要可以使用它并且感觉舒适,您就会获得性能上的回报.
            • 我用它来巧妙地进行交叉制表.


            示例

            对于每个后续答案和问题,我将使用pd.DataFrame.pivot_table进行回答.然后,我将提供替代方法来执行相同的任务.


            Examples

            What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

            如何旋转df以使col值是列,row值是索引,val0的均值是值,而缺少的值是0?

            How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

            • pd.DataFrame.pivot_table

                默认情况下未设置
              • fill_value.我倾向于适当地设置它.在这种情况下,我将其设置为0.请注意,我跳过了问题2 ,因为它与没有fill_value
              • 的答案相同
              • aggfunc='mean'是默认设置,我不必设置它.我将其包括在内是为了明确.

              • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value
              • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

              df.pivot_table(
                  values='val0', index='row', columns='col',
                  fill_value=0, aggfunc='mean')
              
              col   col0   col1   col2   col3  col4
              row                                  
              row0  0.77  0.605  0.000  0.860  0.65
              row2  0.13  0.000  0.395  0.500  0.25
              row3  0.00  0.310  0.000  0.545  0.00
              row4  0.00  0.100  0.395  0.760  0.24
              

            • pd.DataFrame.groupby

              df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
              

            • pd.crosstab

              pd.crosstab(
                  index=df['row'], columns=df['col'],
                  values=df['val0'], aggfunc='mean').fillna(0)
              

            • 我可以得到mean以外的其他东西,例如sum吗?

              Can I get something other than mean, like maybe sum?

              • pd.DataFrame.pivot_table

                df.pivot_table(
                    values='val0', index='row', columns='col',
                    fill_value=0, aggfunc='sum')
                
                col   col0  col1  col2  col3  col4
                row                               
                row0  0.77  1.21  0.00  0.86  0.65
                row2  0.13  0.00  0.79  0.50  0.50
                row3  0.00  0.31  0.00  1.09  0.00
                row4  0.00  0.10  0.79  1.52  0.24
                

              • pd.DataFrame.groupby

                df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
                

              • pd.crosstab

                pd.crosstab(
                    index=df['row'], columns=df['col'],
                    values=df['val0'], aggfunc='sum').fillna(0)
                

              • 我一次可以做一次以上聚合吗?

                Can I do more that one aggregation at a time?

                请注意,对于pivot_tablecrosstab,我需要传递可调用对象列表.另一方面,groupby.agg能够为有限数量的特殊功能使用字符串. groupby.agg也会采用我们传递给其他对象的相同的可调用对象,但是利用字符串函数名称通常会更有效,因为可以提高效率.

                Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

                • pd.DataFrame.pivot_table

                df.pivot_table(
                    values='val0', index='row', columns='col',
                    fill_value=0, aggfunc=[np.size, np.mean])
                
                     size                      mean                           
                col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
                row                                                           
                row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
                row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
                row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
                row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
                

              • pd.DataFrame.groupby

                df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
                

              • pd.crosstab

                pd.crosstab(
                    index=df['row'], columns=df['col'],
                    values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
                

              • 我可以汇总多个值列吗?

                Can I aggregate over multiple value columns?

                • pd.DataFrame.pivot_table我们通过了values=['val0', 'val1'],但我们可以完全忽略它

                  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

                    df.pivot_table(
                        values=['val0', 'val1'], index='row', columns='col',
                        fill_value=0, aggfunc='mean')
                    
                          val0                             val1                          
                    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
                    row                                                                  
                    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
                    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
                    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
                    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
                    

                  • pd.DataFrame.groupby

                    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
                    

                  • 可以细分为多列吗?

                    Can Subdivide by multiple columns?

                    • pd.DataFrame.pivot_table

                      df.pivot_table(
                          values='val0', index='row', columns=['item', 'col'],
                          fill_value=0, aggfunc='mean')
                      
                      item item0             item1                         item2                   
                      col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
                      row                                                                          
                      row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
                      row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
                      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
                      row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
                      

                    • pd.DataFrame.groupby

                      df.groupby(
                          ['row', 'item', 'col']
                      )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
                      

                    • 可以细分为多列吗?

                      Can Subdivide by multiple columns?

                      • pd.DataFrame.pivot_table

                        df.pivot_table(
                            values='val0', index=['key', 'row'], columns=['item', 'col'],
                            fill_value=0, aggfunc='mean')
                        
                        item      item0             item1                         item2                  
                        col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
                        key  row                                                                         
                        key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
                             row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
                             row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
                             row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
                        key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
                             row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
                             row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
                             row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
                        key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
                             row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
                             row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
                        

                      • pd.DataFrame.groupby

                        df.groupby(
                            ['key', 'row', 'item', 'col']
                        )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
                        

                      • pd.DataFrame.set_index,因为键集对于行和列都是唯一的

                      • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

                        df.set_index(
                            ['key', 'row', 'item', 'col']
                        )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
                        

                      • 我可以汇总列和行一起出现的频率,又称为交叉表"吗?

                        Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆