在groupby()。最大(x)操作后切片原始df [英] slice original df after groupby().nlargest(x) operation

查看:230
本文介绍了在groupby()。最大(x)操作后切片原始df的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

groupby()和 nlargest()存在问题此处此处。我试图解决这些问题。注意:为简单起见,我使用了 nlargest(1),但是,它可以是任意数量的选择。

  {'city1':{0:'Chicago',
1:'Chicago',
2:'芝加哥',
3:'芝加哥',
4:'迈阿密',
5:'休斯敦',
6:'奥斯汀'},
'city2':{0:'多伦多',
1:'底特律',
2:'St.Louis',
3:'迈阿密',
4: 'Dallas',
5:'Dallas',
6:'Dallas'},
'p234_r_c':{0:5.0,1:4.0,2:2.0,3:0.5, 4:1.0,5:4.0,6:3.0},
'plant1_type':{0:'COMBCYCL',
1:'COMBCYCL',
2:'NUKE',
3:'COAL',
4:'NUKE',
5:'COMBCYCL',
6:'COAL'},
'plant2_type':{0: '煤',
1:'煤',
2:'COMBCYCL',
3:'COMBCYCL',
4:'煤',
5: 'NUKE',
6:'NUKE',}}



groupby city1 并返回从原始df中选择的行

  cols2 = ['city1',  plant1_type','plant2_type'] 
df.loc [df2.groupby(cols2)['p234_r_c']。nlargest(1).reset_index()。level_3]

city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3芝加哥迈阿密0.5 COAL COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
2芝加哥圣路易斯2.0 NUKE COMBCYCL
5休斯顿达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL

以上看起来不错



b)groupby city2 并返回从原始df中选择的行

由于#A中使用的代码在尝试groupby city2 时会生成伪造结果,因此建议采取以下解决方法:

  cols = ['city2','plant1_type','plant2_type'] 
df.set_index(cols).groupby(level = cols)['p234_r_c '] .nlargest(1)


city2 plant1_type plant2_type
多伦多COMBCYCL COAL 5.0
底特律COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
迈阿密煤炭公司0.5
达拉斯NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0

现在我该如何使用这个结果来返回从原来的df中选择行,就像我在#A中所做的那样?



注意:原来的df有一个额外的行, groupby.nlargest()对于 city2 有至少一个组的大小大于1的组, code> #A 可用于 #B

解决方案

除非我错过了一些东西(我同意这里有潜伏在熊猫代码中的错误),我们可以相对简单地绕过任何困难。



Met的hod#1:使用 loc idxmax

在[21]中:df.loc [df.groupby(cols2)[p234_r_c]。idxmax()] 
Out [21]:
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
3芝加哥迈阿密0.5 COAL COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
2芝加哥圣路易斯2.0 NUKE COMBCYCL
5休斯顿达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL

在[22]:df.loc [df.groupby(cols)[p234_r_c]。idxmax()]
出[22]:
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
5休斯敦达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL
1芝加哥底特律4.0 COMBCYCL煤炭
3芝加哥迈阿密0.5煤炭COMBCYCL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭



<方法2:按 p234_r_c 进行排序,并使用上一个

  In [17]:df.sort_values(p234_r_c)。groupby(cols2,as_index = False).last()
Out [17]:
city1 plant1_type plant2_type city2 p234_r_c
0 Austin COAL NUKE达拉斯3.0
1芝加哥COAL COMBCYCL迈阿密0.5
2芝加哥COMBCYCL COAL多伦多5.0
3芝加哥NUKE COMBCYCL St.Louis 2.0
4休斯敦COMBCYCL NUKE达拉斯4.0
迈阿密NUKE COAL达拉斯1.0

在[18]:df.sort_values(p234_r_c)。groupby(cols,as_index = False)。 last()
Out [18]:
city2 plant1_type plant2_type city1 p234_r_c
0 Dallas Coal NUKE Austin 3.0
1达拉斯COMBCYCL NUKE休斯敦4.0
2达拉斯NUKE COAL迈阿密1.0
3底特律COMBCYCL COAL芝加哥4.0
4迈阿密煤炭COMBCYCL芝加哥0.5
5 St.Louis NUKE COMBCYCL芝加哥2.0
6 Toronto COMBCYCL COAL芝加哥5.0






如果你想要为了能够获得多种回应,而最小和最小的数据都被破坏了,我认为最简单的方法是排序然后使用头部或尾部。例如:

 在[27]中:df.sort_values(p234_r_c)。groupby(cols,as_index = False)。尾巴(2)
出库[27]:
city1 city2 p234_r_c plant1_type plant2_type
3芝加哥迈阿密0.5煤矿开采
4迈阿密达拉斯1.0 NUKE煤矿
2芝加哥街路易斯2.0 NUKE COMBCYCL
6奥斯汀达拉斯3.0 COAL NUKE
1芝加哥底特律4.0 COMBCYCL煤炭
5休斯顿达拉斯4.0 COMBCYCL NUKE
0芝加哥多伦多5.0 COMBCYCL煤炭


Given the problems with groupby() and nlargest() as described here and here. I am trying to work around the problems.

Note: for simplicity I used nlargest(1), however, it could be any number of selections.

{'city1': {0: 'Chicago',
  1: 'Chicago',
  2: 'Chicago',
  3: 'Chicago',
  4: 'Miami',
  5: 'Houston',
  6: 'Austin'},
 'city2': {0: 'Toronto',
  1: 'Detroit',
  2: 'St.Louis',
  3: 'Miami',
  4: 'Dallas',
  5: 'Dallas',
  6: 'Dallas'},
 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
 'plant1_type': {0: 'COMBCYCL',
  1: 'COMBCYCL',
  2: 'NUKE',
  3: 'COAL',
  4: 'NUKE',
  5: 'COMBCYCL',
  6: 'COAL'},
 'plant2_type': {0: 'COAL',
  1: 'COAL',
  2: 'COMBCYCL',
  3: 'COMBCYCL',
  4: 'COAL',
  5: 'NUKE',
  6: 'NUKE',}}

A) groupby city1 and return rows selected from original df

cols2 = ['city1','plant1_type','plant2_type']
df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3]

    city1   city2   p234_r_c    plant1_type plant2_type
6   Austin  Dallas       3.0    COAL        NUKE
3   Chicago Miami        0.5    COAL        COMBCYCL
0   Chicago Toronto      5.0    COMBCYCL    COAL
2   Chicago St.Louis     2.0    NUKE        COMBCYCL
5   Houston Dallas       4.0    COMBCYCL    NUKE
4   Miami   Dallas       1.0    NUKE        COAL

The above looks good

B) groupby city2 and return rows selected from original df

Since the same code used in #A generates a bogus result if attempted for groupby of city2, a workaround was suggested to do the following:

cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)


city2     plant1_type  plant2_type
Toronto   COMBCYCL     COAL           5.0
Detroit   COMBCYCL     COAL           4.0
St.Louis  NUKE         COMBCYCL       2.0
Miami     COAL         COMBCYCL       0.5
Dallas    NUKE         COAL           1.0
          COMBCYCL     NUKE           4.0
          COAL         NUKE           3.0

Now how do I use this result to return the rows selected from the original df as I did in #A?

Note: had the original df had an additional row that made the result of groupby.nlargest() for city2 have groups where at least one group had size greater than 1 then the code in #A can be used for #B.

解决方案

Unless I'm missing something (and I agree there are bugs lurking in the pandas code here), we can bypass any difficulties relatively simply.

Method #1: use loc and idxmax:

In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()]
Out[21]: 
     city1     city2  p234_r_c plant1_type plant2_type
6   Austin    Dallas       3.0        COAL        NUKE
3  Chicago     Miami       0.5        COAL    COMBCYCL
0  Chicago   Toronto       5.0    COMBCYCL        COAL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
5  Houston    Dallas       4.0    COMBCYCL        NUKE
4    Miami    Dallas       1.0        NUKE        COAL

In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()]
Out[22]: 
     city1     city2  p234_r_c plant1_type plant2_type
6   Austin    Dallas       3.0        COAL        NUKE
5  Houston    Dallas       4.0    COMBCYCL        NUKE
4    Miami    Dallas       1.0        NUKE        COAL
1  Chicago   Detroit       4.0    COMBCYCL        COAL
3  Chicago     Miami       0.5        COAL    COMBCYCL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
0  Chicago   Toronto       5.0    COMBCYCL        COAL

Method #2: sort by p234_r_c and use last:

In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last()
Out[17]: 
     city1 plant1_type plant2_type     city2  p234_r_c
0   Austin        COAL        NUKE    Dallas       3.0
1  Chicago        COAL    COMBCYCL     Miami       0.5
2  Chicago    COMBCYCL        COAL   Toronto       5.0
3  Chicago        NUKE    COMBCYCL  St.Louis       2.0
4  Houston    COMBCYCL        NUKE    Dallas       4.0
5    Miami        NUKE        COAL    Dallas       1.0

In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last()
Out[18]: 
      city2 plant1_type plant2_type    city1  p234_r_c
0    Dallas        COAL        NUKE   Austin       3.0
1    Dallas    COMBCYCL        NUKE  Houston       4.0
2    Dallas        NUKE        COAL    Miami       1.0
3   Detroit    COMBCYCL        COAL  Chicago       4.0
4     Miami        COAL    COMBCYCL  Chicago       0.5
5  St.Louis        NUKE    COMBCYCL  Chicago       2.0
6   Toronto    COMBCYCL        COAL  Chicago       5.0


If you want to be able to get multiple responses as well, while nlargest and nsmallest are broken, I think it's simplest to sort and then use head or tail. For example:

In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2)
Out[27]: 
     city1     city2  p234_r_c plant1_type plant2_type
3  Chicago     Miami       0.5        COAL    COMBCYCL
4    Miami    Dallas       1.0        NUKE        COAL
2  Chicago  St.Louis       2.0        NUKE    COMBCYCL
6   Austin    Dallas       3.0        COAL        NUKE
1  Chicago   Detroit       4.0    COMBCYCL        COAL
5  Houston    Dallas       4.0    COMBCYCL        NUKE
0  Chicago   Toronto       5.0    COMBCYCL        COAL

这篇关于在groupby()。最大(x)操作后切片原始df的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆