如何旋转数据框 [英] How to pivot a dataframe

查看:93
本文介绍了如何旋转数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  • 什么是数据透视表?

  • 如何透视?

  • 这是一个数据透视吗?

  • 长格式化为宽格式?



我看到很多关于数据透视表的问题。即使他们不知道他们在询问数据透视表,他们通常也是。实际上不可能编写一个规范的问题和答案,其中包含了所有方面的支点......



...但是我会放弃它。




现有问题和答案的问题在于,问题往往集中在OP难以概括的细微差别上为了使用一些现有的好的答案。然而,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)。



查看我的 google搜索


  1. 如何在Pandas中支持一个数据框?


    • 好的问题和答案。但答案只是回答具体问题,没有什么解释。
    • 42708193/2336654>熊猫数据透视表到数据框


      • 在这个问题中,OP涉及到数据透视表的输出。即列的外观。 OP希望它看起来像R.这对大熊猫用户并不是很有用。
      • stackoverflow.com/q/11400181/2336654\">基于数据框的旋转,重复的行


        • 另一个体面的问题,但答案集中在一种方法,即 pd.DataFrame.pivot


因此,当有人搜索 pivot 时,他们会得到零星的结果,可能无法回答他们的具体问题。


$ b $你可能会注意到,我明显地命名了我的列和相关的列值,以符合我如何使用这些列,我将在下面的答案中重点讨论。请注意,以便熟悉哪些列名称将从哪里获得您要查找的结果。

  import numpy作为np 
从numpy.core.defchararray导入pandas作为pd
导入添加

np.random.seed([3,1415])
n = 20

cols = np.array(['key','row','item','col'])
arr1 =(np.random.randint(5,size =(n, 4))// [2,1,2,1])。astype(str)

df = pd.DataFrame(
add(cols,arr1),columns = cols
$ .boin(
pd.DataFrame(np.random.rand(n,2).round(2))。add_prefix('val')

print(df)

按键行项目col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 i tem1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70



问题




  1. 为什么我得到 ValueError:索引包含重复条目,无法重塑


  2. 如何透过 df ,使得 col 值是列, row 值是索引, code> val0 是值?

      col col0 col1 col2 col3 col4 

    row0 0.77 0.605 NaN 0.860 0.65
    row2 0.13 NaN 0.395 0.500 0.25
    row3 NaN 0.310 NaN 0.545 NaN
    row4 NaN 0.100 0.395 0.760 0.24
  3. 值是列,值是索引, val0 的平均值是值,缺失的值是 0

      col col0 col1 col2 col3 col4 

    行0 0.77 0.605 0.000 0.860 0.65
    行2 0.13 0.000 0.395 0.500 0.25
    行3 0.00 0.310 0.000 0.545 0.00
    行4 0.00 0.100 0.395 0.760 0.24


  4. 我可以获得除以外的其他值,例如 sum

      col col0 col1 col2 col3 col4 
    row
    row0 0.77 1.21 0.00 0.86 0.65
    row2 0.13 0.00 0。 79 0.50 0.50
    row3 0.00 0.31 0.00 1.09 0.00
    row4 0.00 0.10 0.79 1.52 0.24


  5. 我可以一次做多一个聚合吗?

     总和意味着
    col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
    row $ b $ row 0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
    row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
    row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
    row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24


  6. 我可以聚合多个值列吗?

      val0 val1 
    col col0 col1 col2 col3 col4 col0 col1 col1 col2 col3 col4

    row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
    row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
    row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
    row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46


  7. 可以按多列进行细分? p>

     项目项目0项目1项目2 
    col col2 col3 col4 col0 col1 col1 col2 col3 col4 col0 col1 col1 col3 col4
    row
    row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
    row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
    row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
    row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.0 00 0.00 0.00




  8.   item item0 item1 item2 
    col col2 col3 col4 col0 col1 col1 col2 col3 col4 col0 col1 col3 col3 col4
    key row
    key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
    row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
    row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
    row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
    key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
    row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
    row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
    row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
    row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
    row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00


  9. 我可以将列和行一起出现,又名交叉表?

      col col0 col1 col2 col3 col4 col4 
    row
    row0 1 2 0 1 1
    row2 1 0 2 1 2
    row3 0 1 0 2 0
    row4 0 1 2 2 1

    解决方案

    我们从回答第一个问题开始:



    问题1




    为什么我得到 ValueError:索引包含重复的条目,无法重塑


    发生这种情况是因为pandas试图重新索引索引对象有重复的条目。有不同的方法可以使用它来执行数据透视。其中一些并不适合于当它们被要求在其上枢轴转动的键的重复时。例如。考虑 pd.DataFrame.pivot 。我知道有重复的条目共享 col 值:

      df.duplicated(['row','col'])。any()

    True
    pivot
    使用



    >

      df.pivot(index ='row',columns ='col',values ='val0')

    我得到上面提到的错误。事实上,当我尝试执行相同的任务时遇到同样的错误:

      df.set_index(['row' ,'col'])['val0']。unstack()

    以下是成语列表我们可以使用枢轴


    1. pd.DataFrame.groupby + pd.DataFrame.unstack $ b

      • 适用于任何类型的枢纽的良好通用方法

      • 您可以指定将构成一个组中的旋转行级别和列级别的所有列。你可以通过选择你想要聚合的剩余列和你想要执行聚合的函数来实现。最后,您 unstack 您希望成为列索引的级别。


    2. pd.DataFrame。 pivot_table


      • 美化版本 groupby
      • 指定行级别,列级别,要汇总的值以及执行汇总的函数。


    3. pd.DataFrame.set_index + pd.DataFrame.unstack


      • 对于某些人(包括我自己)来说方便直观。
      • groupby 范例类似,我们指定最终为行或列级别的所有列,并且将这些设置为索引。然后我们在列中填入我们想要的级别。 unstack 如果剩余的索引级别或列级别不唯一,则此方法将失败。


    4. pd.DataFrame.pivot


      • 非常类似于 set_index ,因为它共享重复的密钥限制。该API也非常有限。它只需要标量值为 index columns
      • 类似于 pivot_table 方法,因为我们选择要循环的行,列和值。但是,我们无法进行聚合,如果行或列不唯一,则此方法将失败。


    5. pd.crosstab


      • 这是 pivot_table 的一个特殊版本,它是最纯粹的形式直观的方式来执行几项任务。


    6. pd.factorize + np.bincount


      • 这是一项非常先进的技术,非常模糊,但速度非常快。它不能在任何情况下使用,但是当它可以使用并且您可以轻松使用它时,您将获得性能奖励。 > pd.get_dummies + pd.DataFrame.dot 我使用


        • 这是为了巧妙地执行交叉制表。







    示例



    我要为每个后续答案和问题做的事情是使用 pd.DataFrame.pivot_table 来回答它。然后我会提供替代方案来执行相同的任务。

    问题3




    如何透过 df 使得 col 值为列, row 值为索引, val0 是值,缺失值是 0





    • pd.DataFrame.pivot_table




      • fill_value 默认情况下未设置。我倾向于适当地设置它。在这种情况下,我将它设置为 0 。注意我跳过了问题2 ,因为它与此答案相同,但没有 fill_value

      • aggfunc ='mean'是默认值,我不必设置它。我将它包括在内。

          df.pivot_table(
        values ='val0',index ='row' ,col = col,
        fill_value = 0,aggfunc ='mean'
        row2 0.13 0.000 0.395 0.500 0.25
        row3 0.00 0.310 0.000 0.545 0.00
        row4 0.00 0.100 0.395 0.760 0.24



    • pd.DataFrame.groupby $ b

        df.groupby(['row','col'])['val0']。mean()。unstack(fill_value = 0)


    • pd.crosstab

        pd.crosstab(
      index = df ['row'],columns = df ['col'],
      values = df ['val0'],aggfunc ='mean')。fillna(0)







    问题4




    mean ,可能 sum





    • pd.DataFrame.pivot_table

        df.pivot_table(
      values ='val0',index ='row',columns ='col',
      fill_value = 0,aggfunc ='sum')

      col col0 col1 col2 col3 col4
      row
      row0 0.77 1.21 0.00 0.86 0.65
      row2 0.13 0.00 0.79 0.50 0.50
      row3 0.00 0.31 0.00 1.09 0.00
      row4 0.00 0.10 0.79 1.52 0.24


    • pd.DataFrame.groupby $ b $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ df.groupby(['row','col'])['val0']。sum ().unstack(fill_value = 0)


  10. pd.crosstab

      pd.crosstab(
    index = df ['row'],columns = df ['col'],
    values = df ['val0'],aggfunc ='sum')。fillna(0)







  11. 问题5


    我可以同时做多个汇总吗?

    请注意,对于 pivot_table cross_tab 我需要传递可调用列表。另一方面, groupby.agg 可以为有限数量的特殊功能提取字符串。 groupby.agg 也会采用我们传递给其他人的相同的可调用元素,但通过使用字符串函数名称通常会更高效,因为可以获得效率。 / p>


    • pd.DataFrame.pivot_table

        df.pivot_table(
      values ='val0',index ='row',columns ='col',
      fill_value = 0, aggfunc = [np.size,np.mean])

      size平均值
      col col0 col1 col2 col2 col3 col4 col0 col1 col1 col2 col3 col4
      row
      row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
      row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
      row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
      row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24


    • pd.DataFrame.groupby



      <$ p $ ('''''''''')'''''''''''''' b


    • pd.crosstab

        pd.crosstab(
      index = df ['row'],columns = df ['col'],
      values = df ['val0'],aggfunc = [np.size,np.mean])。fillna(0,downcast ='infer')







    问题6




    我可以汇总多个值列吗?




    • pd.DataFrame.pivot_table 我们通过 values = ['val0','val1'] ,但我们可以将它关闭完全

        df.pivot_table(
      values = ['val0','val1'],index ='row' ,columns ='col',
      fill_value = 0,aggfunc ='mean')

      val0 val1
      col col0 col1 col2 col3 col4 col0 col col 1 col2 col3 col4
      row
      row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
      row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
      row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
      row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46


    • pd.DataFrame.groupby

        df.groupby(['row ','col'])['val0','val1']。mean()。unstack(fill_value = 0)







    • 问题7




      可以分成多列吗?





      • pd。 DataFrame.pivot_table

          df.pivot_table(
        values ='val0',index = 'row',columns = ['item','col'],
        fill_value = 0,aggfunc ='mean')

        it em item0 item1 item2
        col col2 col3 col4 col0 col1 col1 col2 col3 col4 col0 col1 col1 col3 col4
        row
        row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
        row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
        row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00


      • pd.DataFrame.groupby

          df.groupby(
        ['row','item','col']
        )['val0']。 (0).sort_index(1)







      问题8




      可以细分多列?





      • pd.DataFrame.pivot_table

          df.pivot_table(
        values ='val0',index = ['key','row'],columns = ['item','col'],
        fill_value = 0,aggfunc ='mean')

        item item0 item1 item2
        col col2 col3 col4 col0 col1 col1 col2 col3 col4 col0 col1 col1 col3 col4
        key row
        key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
        row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
        row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
        row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
        key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
        row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
        row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
        row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
        key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
        row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
        row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00

      • pd.DataFrame.groupby

          df.groupby(
        ['key','row','item','col']
        )['val0']。mean ().unstack(['item','col'])。fillna(0).sort_index(1)


      • pd.DataFrame.set_index ,因为这组键对于行和列都是唯一的

          df.set_index(
        ['key','row','item','col']
        )['val0']。unstack([ 'item','col'])。fillna(0).sort_in dex(1)







      问题9




      我可以合计列和行出现的频率,也就是交叉制表吗?




      • pd.DataFrame.pivot_table

          df.pivot_table(index ='row',columns ='col',fill_value = 0,aggfunc ='size' )

        col col0 col1 col2 col3 col4
        row
        row0 1 2 0 1 1
        row2 1 0 2 1 2
        row3 0 1 0 2 0
        row4 0 1 2 2 1


      • pd。 DataFrame.groupby

          df.groupby(['row','col'])['' val0']。size()。unstack(fill_value = 0)


      • pd.cross_tab

          pd.crosstab(df ['row'],df [' col'])


      • pd.factorize + np.bincount

         #获得整数因子分解`i`和唯一值`r` 
        #列``row'`
        i, r = pd.factorize(df ['row']。values)
        #获得整数因子分解`j`和唯一值`c`
        #列``col'`
        j,c = pd.factorize(df ['col']。values)
        #`n`将是行数
        #`m`将是列数
        n,m = r .size,c.size
        #`i * m + j`是计数
        #分解箱的一种巧妙方法,假设一个扁平数组长度为
        #`n * m`。这就是为什么我们随后重塑为(n,m)`
        b = np.bincount(i * m + j,minlength = n * m).reshape(n,m)
        #我读过这个,我想'豆,米和奶酪'
        pd.DataFrame(b,r,c)

        col3 col2 col0 col1 col1 col4
        row3 2 0 0 1 0
        row2 1 2 1 0 2
        row0 1 0 1 2 1
        row4 2 2 0 1 1


      • pd.get_dummies

          pd.get_dummies(df ['row'])。T.dot(pd.get_dummies(df ['col']))

        col0 col1 col2 col3 col4
        row0 1 2 0 1 1
        row2 1 0 2 1 2
        row3 0 1 0 2 0
        row4 0 1 2 2 1



      • What is pivot?
      • How do I pivot?
      • Is this a pivot?
      • Long format to wide format?

      I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....

      ... But I'm going to give it a go.


      The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

      Look a few examples from my google search

      1. How to pivot a dataframe in Pandas?
        • Good question and answer. But the answer only answers the specific question with little explanation.
      2. pandas pivot table to data frame
        • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
      3. pandas pivoting a dataframe, duplicate rows
        • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

      So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


      Setup

      You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.

      import numpy as np
      import pandas as pd
      from numpy.core.defchararray import add
      
      np.random.seed([3,1415])
      n = 20
      
      cols = np.array(['key', 'row', 'item', 'col'])
      arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
      
      df = pd.DataFrame(
          add(cols, arr1), columns=cols
      ).join(
          pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
      )
      print(df)
      
           key   row   item   col  val0  val1
      0   key0  row3  item1  col3  0.81  0.04
      1   key1  row2  item1  col2  0.44  0.07
      2   key1  row0  item1  col0  0.77  0.01
      3   key0  row4  item0  col2  0.15  0.59
      4   key1  row0  item2  col1  0.81  0.64
      5   key1  row2  item2  col4  0.13  0.88
      6   key2  row4  item1  col3  0.88  0.39
      7   key1  row4  item1  col1  0.10  0.07
      8   key1  row0  item2  col4  0.65  0.02
      9   key1  row2  item0  col2  0.35  0.61
      10  key2  row0  item2  col1  0.40  0.85
      11  key2  row4  item1  col2  0.64  0.25
      12  key0  row2  item2  col3  0.50  0.44
      13  key0  row4  item1  col4  0.24  0.46
      14  key1  row3  item2  col3  0.28  0.11
      15  key0  row3  item1  col1  0.31  0.23
      16  key0  row0  item2  col3  0.86  0.01
      17  key0  row4  item0  col3  0.64  0.21
      18  key2  row2  item2  col0  0.13  0.45
      19  key0  row2  item0  col4  0.37  0.70
      

      Question(s)

      1. Why do I get ValueError: Index contains duplicate entries, cannot reshape

      2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

        col   col0   col1   col2   col3  col4
        row                                  
        row0  0.77  0.605    NaN  0.860  0.65
        row2  0.13    NaN  0.395  0.500  0.25
        row3   NaN  0.310    NaN  0.545   NaN
        row4   NaN  0.100  0.395  0.760  0.24
        

      3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

        col   col0   col1   col2   col3  col4
        row                                  
        row0  0.77  0.605  0.000  0.860  0.65
        row2  0.13  0.000  0.395  0.500  0.25
        row3  0.00  0.310  0.000  0.545  0.00
        row4  0.00  0.100  0.395  0.760  0.24
        

      4. Can I get something other than mean, like maybe sum?

        col   col0  col1  col2  col3  col4
        row                               
        row0  0.77  1.21  0.00  0.86  0.65
        row2  0.13  0.00  0.79  0.50  0.50
        row3  0.00  0.31  0.00  1.09  0.00
        row4  0.00  0.10  0.79  1.52  0.24
        

      5. Can I do more that one aggregation at a time?

               sum                          mean                           
        col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
        row                                                                
        row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
        row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
        row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
        row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
        

      6. Can I aggregate over multiple value columns?

              val0                             val1                          
        col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
        row                                                                  
        row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
        row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
        row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
        row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
        

      7. Can Subdivide by multiple columns?

        item item0             item1                         item2                   
        col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
        row                                                                          
        row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
        row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
        row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
        row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
        

      8. Or

        item      item0             item1                         item2                  
        col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
        key  row                                                                         
        key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
             row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
             row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
             row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
        key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
             row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
             row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
             row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
        key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
             row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
             row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
        

      9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

        col   col0  col1  col2  col3  col4
        row                               
        row0     1     2     0     1     1
        row2     1     0     2     1     2
        row3     0     1     0     2     0
        row4     0     1     2     2     1
        

      解决方案

      We start by answering the first question:

      Question 1

      Why do I get ValueError: Index contains duplicate entries, cannot reshape

      This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

      df.duplicated(['row', 'col']).any()
      
      True
      

      So when I pivot using

      df.pivot(index='row', columns='col', values='val0')
      

      I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

      df.set_index(['row', 'col'])['val0'].unstack()
      

      Here is a list of idioms we can use to pivot

      1. pd.DataFrame.groupby + pd.DataFrame.unstack
        • Good general approach for doing just about any type of pivot
        • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
      2. pd.DataFrame.pivot_table
        • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
        • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
      3. pd.DataFrame.set_index + pd.DataFrame.unstack
        • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
        • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
      4. pd.DataFrame.pivot
        • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
        • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
      5. pd.crosstab
        • This a specialized version of pivot_table and in it's purest form is the most intuitive way to perform several tasks.
      6. pd.factorize + np.bincount
        • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
      7. pd.get_dummies + pd.DataFrame.dot
        • I use this for cleverly performing cross tabulation.


      Examples

      What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

      Question 3

      How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

      • pd.DataFrame.pivot_table

        • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value
        • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

          df.pivot_table(
              values='val0', index='row', columns='col',
              fill_value=0, aggfunc='mean')
          
          col   col0   col1   col2   col3  col4
          row                                  
          row0  0.77  0.605  0.000  0.860  0.65
          row2  0.13  0.000  0.395  0.500  0.25
          row3  0.00  0.310  0.000  0.545  0.00
          row4  0.00  0.100  0.395  0.760  0.24
          

      • pd.DataFrame.groupby

        df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
        

      • pd.crosstab

        pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='mean').fillna(0)
        


      Question 4

      Can I get something other than mean, like maybe sum?

      • pd.DataFrame.pivot_table

        df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc='sum')
        
        col   col0  col1  col2  col3  col4
        row                               
        row0  0.77  1.21  0.00  0.86  0.65
        row2  0.13  0.00  0.79  0.50  0.50
        row3  0.00  0.31  0.00  1.09  0.00
        row4  0.00  0.10  0.79  1.52  0.24
        

      • pd.DataFrame.groupby

        df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
        

      • pd.crosstab

        pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc='sum').fillna(0)
        


      Question 5

      Can I do more that one aggregation at a time?

      Notice that for pivot_table and cross_tab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

      • pd.DataFrame.pivot_table

        df.pivot_table(
            values='val0', index='row', columns='col',
            fill_value=0, aggfunc=[np.size, np.mean])
        
             size                      mean                           
        col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
        row                                                           
        row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
        row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
        row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
        row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
        

      • pd.DataFrame.groupby

        df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
        

      • pd.crosstab

        pd.crosstab(
            index=df['row'], columns=df['col'],
            values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
        


      Question 6

      Can I aggregate over multiple value columns?

      • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

        df.pivot_table(
            values=['val0', 'val1'], index='row', columns='col',
            fill_value=0, aggfunc='mean')
        
              val0                             val1                          
        col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
        row                                                                  
        row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
        row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
        row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
        row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
        

      • pd.DataFrame.groupby

        df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
        


      Question 7

      Can Subdivide by multiple columns?

      • pd.DataFrame.pivot_table

        df.pivot_table(
            values='val0', index='row', columns=['item', 'col'],
            fill_value=0, aggfunc='mean')
        
        item item0             item1                         item2                   
        col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
        row                                                                          
        row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
        row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
        row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
        row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
        

      • pd.DataFrame.groupby

        df.groupby(
            ['row', 'item', 'col']
        )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
        


      Question 8

      Can Subdivide by multiple columns?

      • pd.DataFrame.pivot_table

        df.pivot_table(
            values='val0', index=['key', 'row'], columns=['item', 'col'],
            fill_value=0, aggfunc='mean')
        
        item      item0             item1                         item2                  
        col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
        key  row                                                                         
        key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
             row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
             row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
             row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
        key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
             row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
             row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
             row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
        key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
             row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
             row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
        

      • pd.DataFrame.groupby

        df.groupby(
            ['key', 'row', 'item', 'col']
        )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
        

      • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

        df.set_index(
            ['key', 'row', 'item', 'col']
        )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
        


      Question 9

      Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

      • pd.DataFrame.pivot_table

        df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
        
            col   col0  col1  col2  col3  col4
        row                               
        row0     1     2     0     1     1
        row2     1     0     2     1     2
        row3     0     1     0     2     0
        row4     0     1     2     2     1
        

      • pd.DataFrame.groupby

        df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
        

      • pd.cross_tab

        pd.crosstab(df['row'], df['col'])
        

      • pd.factorize + np.bincount

        # get integer factorization `i` and unique values `r`
        # for column `'row'`
        i, r = pd.factorize(df['row'].values)
        # get integer factorization `j` and unique values `c`
        # for column `'col'`
        j, c = pd.factorize(df['col'].values)
        # `n` will be the number of rows
        # `m` will be the number of columns
        n, m = r.size, c.size
        # `i * m + j` is a clever way of counting the 
        # factorization bins assuming a flat array of length
        # `n * m`.  Which is why we subsequently reshape as `(n, m)`
        b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
        # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
        pd.DataFrame(b, r, c)
        
              col3  col2  col0  col1  col4
        row3     2     0     0     1     0
        row2     1     2     1     0     2
        row0     1     0     1     2     1
        row4     2     2     0     1     1
        

      • pd.get_dummies

        pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
        
              col0  col1  col2  col3  col4
        row0     1     2     0     1     1
        row2     1     0     2     1     2
        row3     0     1     0     2     0
        row4     0     1     2     2     1
        

      这篇关于如何旋转数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆