使用Python是否可以根据特定参数对数据框进行分组? [英] Is grouping in dataframe based on specific parameters possible using Python?

查看:148
本文介绍了使用Python是否可以根据特定参数对数据框进行分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您在excel中有大量数据集(xlsx,csv或xls),并且必须选择某些重复值,该怎么做?这就像是一种非常模糊和广泛的陈述方式……

When you have a large data set in excel (xlsx, csv, or xls) and have certain repeating values that you have to select for, how do you do it? That's like a very vague and broad way of stating it so...

以这个例子为例:

DataFrame1 :

DataFrame1:

**Name**    **No.**      **Comment**       
Bob        2123320     Doesn't Matter   
Joe        2832883     Whatever           
John       2139300     Irrelevant        
Bob        2123320     Something          
John       2234903     Regardless

DataFrame2:

DataFrame2:

**Name**    **No.**      **Report**       
Bob        2123320         Great 
Joe        2832883         Solid           
John       2139300        Awesome        
Bob        2123320         Good          
John       2234903        Perfect

我基本上是在寻找一种方法,只选择一个名字出现过两次的编号,然后像这样列出它们:

I am basically looking for a way to only select No.'s that have appeared twice for one name and then list them out like this:

**Name**         **2139300**   **2139300**   **2234903**  **2234903**
 John            Irrelevant      Awesome      Regardless    Perfect

对于每个名称,然后对于每个名称,它看起来都看它有多少个不同的编号,对于每个不同的编号,它搜索注释和报告是什么,然后输出一个excel像上面的表。尽管鲍勃出现了两次,但由于两次他都拥有相同的编号,所以这并不算数,约翰是唯一相关的人。

So basically it looks for each name, and then for each name it looks to see how many distinct No.'s it has, and for each distinct No.'s, it searches for what the "Comment" and "Report" is and then ouputs an excel sheet like above. Although Bob appeared twice, since both times he had the same No., it doesn't count and John is the only relevant person.

是否有一种方法可以通过使用Pandas将其导入到数据框中,例如通过使用字典来计算每个名称的每个编号,然后合并数据框吗?

Is there a way to do this once imported into a dataframe using Pandas, like perhaps by using a dictionary that counts each No. for each name and then merging the dataframes?

非常感谢

推荐答案

我会这样做:

df_out = pd.concat([df1,df2])
df_out = (df_out[df_out.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
              .reset_index(drop=True)
              .set_index(['Name','No.'], append=True)['Comment']
              .unstack([0,2]))
df_out.columns = df_out.columns.droplevel(0)
df_out

输出:

No.      2139300     2234903  2139300  2234903
Name                                          
John  Irrelevant  Regardless  Awesome  Perfect

使用 reset_index 获取每行的唯一索引,然后在该索引后附加名称和编号,并取消堆叠新的行号索引和编号以创建多索引列标题,然后删除

Use reset_index to get unique index per row, then append 'name' and 'no.' to that index and unstack new row number index and no.to create a multiindex column header, then drop the top level of the column header.

您可以使用:

df_out.rename_axis(None, axis=1).rename_axis(None)

摆脱索引命名并创建一个看起来更干净的表外观数据框:

To get rid of index names and create a more "clean" table looking dataframe:

         2139300     2234903  2139300  2234903
John  Irrelevant  Regardless  Awesome  Perfect

这篇关于使用Python是否可以根据特定参数对数据框进行分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆