如何对非数字值的数据框进行分组和透视 [英] How to groupby and pivot a dataframe with non-numeric values

查看:100
本文介绍了如何对非数字值的数据框进行分组和透视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python,并且我有一个6列的数据集,R,Rc,J,T,Ca和Cb.我需要先汇总"列"R",然后再"J",以便对于每个R,每一行都是唯一的"J". Rc是R的特征.Ca和Cb是T的特征.下面的表格会更有意义.

I'm using Python, and I have a dataset of 6 columns, R, Rc, J, T, Ca and Cb. I need to "aggregate" on the columns "R" then "J", so that for each R, each row is a unique "J". Rc is a characteristic of R. Ca and Cb are characteristics of T. It will make more sense looking at the table below.

我需要从:

#______________________            ________________________________________________________________
#| R  Rc  J  T  Ca  Cb|           |# R  Rc  J  Ca(T=1)  Ca(T=2)  Ca(T=3)  Cb(T=1)  Cb(T=2)  Cb(T=3)|
#| a   p  1  1  x    d|           |# a  p   1    x         y        z        d        e        f   |
#| a   p  1  2  y    e|           |# b  o   1    w                           g                     |  
#| a   p  1  3  z    f|  ----->   |# b  o   2    v                           h                     | 
#| b   o  1  1  w    g|           |# b  o   3    s                           i                     |
#| b   o  2  1  v    h|           |# c  n   1    t         r                 j        k            |
#| b   o  3  1  s    i|           |# c  n   2    u                           l                     |
#| c   n  1  1  t    j|           |________________________________________________________________|
#| c   n  1  2  r    k|           
#| c   n  2  1  u    l|
#|____________________|

data = {'R' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
        'Rc': ['p', 'p', 'p', 'o', 'o', 'o', 'n', 'n', 'n'],
        'J' : [1, 1, 1, 1, 2, 3, 1, 1, 2], 
        'T' : [1, 2, 3, 1, 1, 1, 1, 2, 1], 
        'Ca': ['x', 'y', 'z', 'w', 'v', 's', 't', 'r', 'u'],
        'Cb': ['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']}

df = pd.DataFrame(data=data)

我不想丢失Rc,Ca或Cb中的数据.

I don't want to lose the data in Rc, Ca, or Cb.

Rc(或每个以'c'结尾的列)对于每个R都是相同的,因此可以将其与R分组.

Rc (or each column that ends in 'c') is the same for each R, so that can just be grouped with R.

但是,对于每个T,Ca和Cb(或以'C'开头的每一列)都是唯一的,这将被汇总并丢失.取而代之的是,在T = 1时将它们保存在名为Ca(T = 1)的新列中,在T = 2时将它们保存在名为Ca(T = 2)的新列中,而在T = 3时将它们保存在新的列中. Cb也是如此.

But Ca and Cb (or each column that starts with 'C') are unique for each T, which will be aggregated and otherwise lost. These need to instead be saved in new columns named Ca(T=1) for when T=1, Ca(T=2) for when T=2, and Ca(T=3) for when T=3. The same goes for Cb.

因此,使用T时,我需要为给定T的每个Ca和Cb创建T列数,以将Ca和Cb中的数据写入新列.

So using T, I need to create T number of columns for each Ca and Cb given T, that writes the data from Ca and Cb into the new columns.

PS.如果有帮助,则J列和T列都有一个额外的具有唯一ID的列.

PS. If it helps, columns J and T both have an extra column with unique IDs.

J_ID = [1,1,1,2,3,4,5,5,6]
T_ID = [1,2,3,4,5,6,7,8,9]

到目前为止我尝试过的事情:

What I tried so far:

(
    df.groupby(['R','J'])
    .apply(lambda x: x.Ca.tolist()).apply(pd.Series)
    .rename(columns=lambda x: f'Ca{x+1}')
    .reset_index()
)

问题:仅可能与C之一有关,而我失去了Rc.

Problem: Only possible to do with one of the C's and I lose Rc.

任何帮助将不胜感激!

推荐答案

您可以使用pivot_table(

You can use pivot_table (here the docs) with a lambda function as aggfunc argument:

table = pd.pivot_table(df, index = ['R','Rc','J'],values = ['Ca','Cb'],
                    columns = ['T'], fill_value = '', aggfunc = lambda x: ''.join(str(v) for v in x)).reset_index()


   R Rc  J Ca       Cb      
T           1  2  3  1  2  3
0  a  p  1  x  y  z  d  e  f
1  b  o  1  w        g      
2  b  o  2  v        h      
3  b  o  3  s        i      
4  c  n  1  t  r     j  k   
5  c  n  2  u        l      

然后,您可以删除多索引列并按以下方式重命名(摘自这个好答案):

Then you can remove the multiindex columns and rename as follow (taken from this great answer):

table.columns = ['%s%s' % (a, ' (T = %s)' % b if b else '') for a, b in table.columns]

   R Rc  J Ca (T = 1) Ca (T = 2) Ca (T = 3) Cb (T = 1) Cb (T = 2) Cb (T = 3)
0  a  p  1          x          y          z          d          e          f
1  b  o  1          w                                g                      
2  b  o  2          v                                h                      
3  b  o  3          s                                i                      
4  c  n  1          t          r                     j          k           
5  c  n  2          u                                l                      

这篇关于如何对非数字值的数据框进行分组和透视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆