数据框分组的多个索引 [英] Multiple Indexes for Dataframe Grouping

查看:113
本文介绍了数据框分组的多个索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将从示例开始,然后分解正在发生的事情.

这是示例输入:

DataFrame:

**Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99

希望的输出:

**Name**         ********2139300*********     ********2234903*******
                  Math   English  Science     Math   English  Science 
  John            0       0         85        99        0          0

就像标题所示,我正在尝试应用多个索引.因此,基本上,它首先查找每个名称,然后查找每个名称以查看其具有多少个不同的编号.在这种情况下,它将阈值设置为至少2个不同的编号(这就是为什么只输出John而没有输出Joe/Bob的原因).

现在每个不同的No.我有一个要搜索的特定测试子集,在这种情况下,只有{Math,English,Science}.对于这些测试中的每一个,如果相关人员以该编号参加考试,则应该有一个等级.我希望为有关的考试以及该人在该编号上未参加的考试输出该成绩.我希望它输出某种简单的标记(即,如果该人当天只参加了数学考试,英语和科学输出0).

因此,实际上,它首先根据不同编号对人员进行索引并将其分组.然后按测试类型为它们建立索引(我只需要一个子集).最后,它为每个人分配了他们所参加的测试类型的值,以及他们并非简单输出0的值.

这与我之前问过的另一个问题类似: Python#2-后续中的分组特征矩阵

除了现在不是1和0,我还有一列要输出的实际值.

谢谢.

更多示例/输出

 **Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99
Bob        2932848         English         99


  **Name**    2139300        2234903       2123320      2932848
          M   E    S      M   E    S    M   E    S    M   E    S
  John    0   0    85    99   0    0   Nan  Nan  Nan  Nan  Nan Nan
  Bob     Nan Nan  Nan   Nan  nan  Nan 86   0    0    0    99  0

解决方案

让我们使用:

将数据框过滤为仅与您有关的那些记录

df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]

现在,用set_indexunstackreindex重塑数据框:

df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
      .unstack(-1, fill_value=0)\
      .reindex(['Math','English','Science'], axis=1, fill_value=0)\
      .unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
      .sort_index(1)

输出:

No.  2123320              2139300              2234903              2932848             
Test English Math Science English Math Science English Math Science English Math Science
Name                                                                                    
Bob        0    0       0       0    0       0       0    0       0      99    0       0
John       0    0       0       0    0      85       0   99       0       0    0       0

I'll just start with the example and then break down what is happening.

This is a sample input:

DataFrame:

**Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99

Hopeful output:

**Name**         ********2139300*********     ********2234903*******
                  Math   English  Science     Math   English  Science 
  John            0       0         85        99        0          0

Like the title suggests, I am trying to apply multiple indexes. So basically it starts by looking for each name, and then for each name it finds to see how many distinct No.'s it has. In this case it sets a threshold at at least 2 distinct No.'s (which is why only John is outputted and Joe/Bob are not).

Now in each of these distinct No's. I have a specific subset of Tests I want to search for, in this case only {Math, English, Science}. For each of these tests, if the person in question took it in that No., there should be a grade. I would like that grade to be outputted for the test in question and for the tests not taken by that person on that No. I would like it to output some sort of simple marker (i.e if the person only took Math on that day, for English and Science output 0).

So in effect, it first indexes people by the number of distinct No.'s and groups them as such. It then indexes them by type of Test (for which I only want a subset). It finally assigns each person a value for the type of test they took and for the ones they didn't simply outputs an 0.

It's similar to another problem I asked earlier: Grouped Feature Matrix in Python #2- Follow Up

Except now instead of 1's and 0's I have another column with actual values that I would like to output.

Thank you.

EDIT: More sample/Output

 **Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99
Bob        2932848         English         99


  **Name**    2139300        2234903       2123320      2932848
          M   E    S      M   E    S    M   E    S    M   E    S
  John    0   0    85    99   0    0   Nan  Nan  Nan  Nan  Nan Nan
  Bob     Nan Nan  Nan   Nan  nan  Nan 86   0    0    0    99  0

解决方案

Let's use:

Filter the dataframe to only those records you are concerned with

df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]

Now, reshape dataframe with set_index, unstack, and reindex:

df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
      .unstack(-1, fill_value=0)\
      .reindex(['Math','English','Science'], axis=1, fill_value=0)\
      .unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
      .sort_index(1)

Output:

No.  2123320              2139300              2234903              2932848             
Test English Math Science English Math Science English Math Science English Math Science
Name                                                                                    
Bob        0    0       0       0    0       0       0    0       0      99    0       0
John       0    0       0       0    0      85       0   99       0       0    0       0

这篇关于数据框分组的多个索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆