如何旋转数据框? [英] How can I pivot a dataframe?

查看：32 发布时间：2021/11/30 22:59:26 python pandas group-by pivot pandas-groupby

本文介绍了如何旋转数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

什么是枢轴?
我如何旋转?
这是一个支点吗?
长格式到宽格式?

我看过很多关于数据透视表的问题.即使他们不知道他们在询问数据透视表，他们通常也是.几乎不可能写出一个规范的问题和答案来涵盖旋转的所有方面......

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

...但我要试一试.

现有问题和答案的问题在于，问题通常集中在 OP 难以概括的细微差别上，以便使用许多现有的好答案.但是，没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

看看我的Google 搜索

如何在 Pandas 中透视数据框?

很好的问答.但答案只回答了具体问题，几乎没有解释.

pandas 数据透视表到数据框

在这个问题中，OP 与枢轴的输出有关.即列的外观.OP 希望它看起来像 R.这对 Pandas 用户不是很有帮助.

pandas 旋转数据框，重复行

另一个不错的问题，但答案集中在一种方法上，即 pd.DataFrame.pivot

因此，每当有人搜索 pivot 时，他们都会得到零星的结果，而这些结果可能不会回答他们的具体问题.

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.

您可能会注意到，我明显地命名了我的列和相关列值，以与我将在下面的答案中旋转的方式相对应.

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

问题

为什么会出现ValueError:索引包含重复条目，无法重塑

我如何旋转 df 使得 col 值为列，row 值为索引，val0 是值吗?

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

 col   col0   col1   col2   col3  col4
 row
 row0  0.77  0.605    NaN  0.860  0.65
 row2  0.13    NaN  0.395  0.500  0.25
 row3   NaN  0.310    NaN  0.545   NaN
 row4   NaN  0.100  0.395  0.760  0.24

我如何旋转 df 使得 col 值为列，row 值为索引，val0 是值，缺失值是 0?

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

 col   col0   col1   col2   col3  col4
 row
 row0  0.77  0.605  0.000  0.860  0.65
 row2  0.13  0.000  0.395  0.500  0.25
 row3  0.00  0.310  0.000  0.545  0.00
 row4  0.00  0.100  0.395  0.760  0.24

我能得到除 mean 以外的其他东西，比如 sum 吗?

Can I get something other than mean, like maybe sum?

 col   col0  col1  col2  col3  col4
 row
 row0  0.77  1.21  0.00  0.86  0.65
 row2  0.13  0.00  0.79  0.50  0.50
 row3  0.00  0.31  0.00  1.09  0.00
 row4  0.00  0.10  0.79  1.52  0.24

我可以一次做多个聚合吗?

Can I do more that one aggregation at a time?

        sum                          mean
 col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
 row
 row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
 row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
 row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
 row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24

我可以聚合多个值列吗?

Can I aggregate over multiple value columns?

       val0                             val1
 col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
 row
 row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
 row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
 row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
 row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

可以按多列细分吗?

Can Subdivide by multiple columns?

 item item0             item1                         item2
 col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
 row
 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
 row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
 row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
 row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

 item      item0             item1                         item2
 col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
 key  row
 key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
      row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
      row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
 key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
      row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
      row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
      row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
 key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
      row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
      row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

我可以汇总列和行一起出现的频率，也就是交叉表"吗?

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

 col   col0  col1  col2  col3  col4
 row
 row0     1     2     0     1     1
 row2     1     0     2     1     2
 row3     0     1     0     2     0
 row4     0     1     2     2     1

如何通过仅以两列为中心将 DataFrame 从长转换为宽?鉴于，

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
df2
   A   B
0  a   0
1  a  11
2  a   2
3  a  11
4  b  10
5  b  10
6  b  14
7  c   7

预期应该看起来像

      a     b    c
0   0.0  10.0  7.0
1  11.0  10.0  NaN
2   2.0  14.0  NaN
3  11.0   NaN  NaN

如何在 pivot 之后将多个索引展平为单个索引?

How do I flatten the multiple index to single index after pivot?

来自

到

   1|1  2|1  2|2
a    2    1    1
b    2    1    0
c    1    0    0

示例

对于每个后续的答案和问题，我要做的是使用 pd.DataFrame.pivot_table.然后我将提供执行相同任务的替代方案.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

我如何旋转 df 使得 col 值为列，row 值为索引，val0<的平均值/code> 是值，缺失值是 0?



How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?
pd.DataFrame.pivot_table
fill_value 默认未设置.我倾向于适当地设置它.在这种情况下，我将其设置为 0.请注意，我跳过了问题 2，因为它与没有 fill_value

fill_value is not set by default.  I tend to set it appropriately.  In this case I set it to 0.  Notice I skipped question 2 as it's the same as this answer without the fill_value
aggfunc='mean' 是默认值，我不需要设置它.我包括它是为了明确.
aggfunc='mean' is the default and I didn't have to set it.  I included it to be explicit.
    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='mean')

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24


pd.DataFrame.groupby
  df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)

pd.crosstab
  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc='mean').fillna(0)


我能得到除 mean 以外的东西，比如 sum 吗?

Can I get something other than mean, like maybe sum?
pd.DataFrame.pivot_table
  df.pivot_table(
      values='val0', index='row', columns='col',
      fill_value=0, aggfunc='sum')

  col   col0  col1  col2  col3  col4
  row
  row0  0.77  1.21  0.00  0.86  0.65
  row2  0.13  0.00  0.79  0.50  0.50
  row3  0.00  0.31  0.00  1.09  0.00
  row4  0.00  0.10  0.79  1.52  0.24

pd.DataFrame.groupby
  df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)

pd.crosstab
  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc='sum').fillna(0)


我可以一次做多个聚合吗?

Can I do more that one aggregation at a time?
请注意，对于 pivot_table 和 crosstab，我需要传递可调用列表.另一方面，groupby.agg 能够为有限数量的特殊函数获取字符串.groupby.agg 也会采用我们传递给其他函数的相同可调用对象，但利用字符串函数名称通常更有效，因为可以提高效率.
Notice that for pivot_table and crosstab I needed to pass list of callables.  On the other hand, groupby.agg is able to take strings for a limited number of special functions.  groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
  df.pivot_table(
      values='val0', index='row', columns='col',
      fill_value=0, aggfunc=[np.size, np.mean])

       size                      mean
  col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
  row
  row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
  row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
  row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
  row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby
  df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)

pd.crosstab
  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')


我可以聚合多个值列吗?

Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table 我们传递了 values=['val0', 'val1'] 但我们可以完全忽略它


pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely
  df.pivot_table(
      values=['val0', 'val1'], index='row', columns='col',
      fill_value=0, aggfunc='mean')

        val0                             val1
  col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
  row
  row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
  row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
  row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
  row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

pd.DataFrame.groupby
  df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)


可以按多列细分吗?
pd.DataFrame.pivot_table
  df.pivot_table(
      values='val0', index='row', columns=['item', 'col'],
      fill_value=0, aggfunc='mean')

  item item0             item1                         item2
  col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
  row
  row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
  row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
  row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
  row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

pd.DataFrame.groupby
  df.groupby(
      ['row', 'item', 'col']
  )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)


可以按多列细分吗?
pd.DataFrame.pivot_table
  df.pivot_table(
      values='val0', index=['key', 'row'], columns=['item', 'col'],
      fill_value=0, aggfunc='mean')

  item      item0             item1                         item2
  col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
  key  row
  key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
       row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
       row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
       row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
  key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
       row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
       row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
       row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
  key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
       row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
       row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

pd.DataFrame.groupby
  df.groupby(
      ['key', 'row', 'item', 'col']
  )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

pd.DataFrame.set_index 因为键集对于行和列都是唯一的


pd.DataFrame.set_index because the set of keys are unique for both rows and columns
  df.set_index(
      ['key', 'row', 'item', 'col']
  )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)


我可以汇总列和行一起出现的频率，也就是交叉表"吗?

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
  df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

      col   col0  col1  col2  col3  col4
  row
  row0     1     2     0     1     1
  row2     1     0     2     1     2
  row3     0     1     0     2     0
  row4     0     1     2     2     1

pd.DataFrame.groupby
  df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)

pd.crosstab
  pd.crosstab(df['row'], df['col'])

pd.factorize + np.bincount
  # get integer factorization `i` and unique values `r`
  # for column `'row'`
  i, r = pd.factorize(df['row'].values)
  # get integer factorization `j` and unique values `c`
  # for column `'col'`
  j, c = pd.factorize(df['col'].values)
  # `n` will be the number of rows
  # `m` will be the number of columns
  n, m = r.size, c.size
  # `i * m + j` is a clever way of counting the
  # factorization bins assuming a flat array of length
  # `n * m`.  Which is why we subsequently reshape as `(n, m)`
  b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
  # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
  pd.DataFrame(b, r, c)

        col3  col2  col0  col1  col4
  row3     2     0     0     1     0
  row2     1     2     1     0     2
  row0     1     0     1     2     1
  row4     2     2     0     1     1

pd.get_dummies
  pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

        col0  col1  col2  col3  col4
  row0     1     2     0     1     1
  row2     1     0     2     1     2
  row3     0     1     0     2     0
  row4     0     1     2     2     1


如何通过仅以两个为轴将 DataFrame 从长转换为宽列?

How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?
DataFrame.pivot
第一步是为每一行分配一个数字——这个数字将是该值在旋转结果中的行索引.这是使用  完成的GroupBy.cumcount:
The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:
  df2.insert(0, 'count', df2.groupby('A').cumcount())
  df2

     count  A   B
  0      0  a   0
  1      1  a  11
  2      2  a   2
  3      3  a  11
  4      0  b  10
  5      1  b  10
  6      2  b  14
  7      0  c   7

第二步是使用新创建的列作为索引调用DataFrame.pivot.
The second step is to use the newly created column as the index to call DataFrame.pivot.
  df2.pivot(*df2)
  # df2.pivot(index='count', columns='A', values='B')

  A         a     b    c
  count
  0       0.0  10.0  7.0
  1      11.0  10.0  NaN
  2       2.0  14.0  NaN
  3      11.0   NaN  NaN

DataFrame.pivot_table
而 DataFrame.pivot 只接受列，DataFrame.pivot_table 也接受数组，所以 GroupBy.cumcount 可以作为 index 直接传递代码> 无需创建显式列.
Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.
  df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')

  A         a     b    c
  0       0.0  10.0  7.0
  1      11.0  10.0  NaN
  2       2.0  14.0  NaN
  3      11.0   NaN  NaN


如何在pivot
如果 columns 用字符串 join
df.columns = df.columns.map('|'.join)

else 格式
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)


                        这篇关于如何旋转数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何旋转数据框? [英] How can I pivot a dataframe?

问题描述

问题

推荐答案

示例

Examples

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何旋转数据框? [英] How can I pivot a dataframe?

问题描述

问题

推荐答案

示例

Examples

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭