使用 pandas 选择每个组的列中最大的N个 [英] select largest N of a column of each groupby group using pandas

查看:102
本文介绍了使用 pandas 选择每个组的列中最大的N个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的df:

  {'city1':{0:'Chicago',
1:'Chicago ',
2:'芝加哥',
3:'芝加哥',
4:'迈阿密',
5:'休斯敦',
6:'奥斯汀'',
'city2':{0:'多伦多',
1:'底特律',
2:'圣路易斯',$ b $ 3:'迈阿密',
4:'Dallas',
5:'Dallas',
6:'Dallas'},
'p234_r_c':{0:5.0,1:4.0,2: 2.0,3:0.5,4:1.0,5:4.0,6:3.0},
'plant1_type':{0:'COMBCYCL',
1:'COMBCYCL',
2: 'NUKE',
3:'COAL',
4:'NUKE',
5:'COMBCYCL',
6:'COAL'},
' plant2_type':{0:'COAL',
1:'COAL',
2:'COMBCYCL',
3:'COMBCYCL',
4:'COAL',
5:'NUKE',
6:'NUKE'}}

我想要做2个groupby操作,并使用列 p234_r_c 来获取每个组中最大的1个。



1st groupby = ['plant1_type','plant2_type','city1']

2nd groupby = ['plant1_type','plant2_type','city2']



以下内容:

  df.groupby(['plant1_type','plant2_type','city1'])['p234_r_c']。 \ 
nlargest(1).reset_index()


plant1_type plant2_type city1 level_3 p234_r_c
0 Coal COMBCYCL Chicago 3 0.5
1 COAL NUKE Austin 6 3.0
2 COMBCYCL COAL芝加哥0 5.0
3 COMBCYCL NUKE休斯敦5 4.0
4 NUKE COAL迈阿密4 1.0
5 NUKE COMBCYCL芝加哥2 2.0

第一组的结果是有意义的。不过,我对第二组的结果感到困惑:

  df.groupby(['plant1_type','plant2_type', 'city2'])['p234_r_c']。\ 
nlargest(1).reset_index()

index p234_r_c
0 0 5.0
1 1 4.0
2 2 2.0
3 3 0.5
4 4 1.0
5 5 4.0
6 6 3.0

plant1_type plant2_type 和<$ c发生了什么$ c> city2 在结果中?不应该出现在结果中,就像 plant1_type plant2_type city1 出现在第一组的结果中?

解决方案

我在这里添加了一个问题

理论:


pd.Series 上的 groupby c>返回相同的 pd.Series 值,然后返回原始索引。




下面的例子

  df = pd.DataFrame(dict(A = [0,1,2,3]) )

#返回与df.A
相同的结果print(df.groupby(df.A // 2).A.nsmallest(2))

#返回结果乱序
print(df.groupby(df.A // 2).A.nlargest(2))

0 0
1 1
2 2
3 3
名称:A,dtype:int64
A
0 1 1
0 0
1 3 3
2 2
名称:A,dtype:int64

我会说你想要这些返回相同的一致索引。



这是最令人震惊的结果:

 #最令人震惊的
#这将是随机不同的
print(df.groupby(df.A // 2).A.apply(pd.Series.sample,n = 2) )






在一次执行中返回此值

  A 
0 1 1
0 0
1 2 2
3 3
名称:A,dtype:int64

另外还有一个

  0 0 
1 1
2 2
3 3
名称:A,dtype:int64

当然,这永远不会有问题,因为不可能像原始的那样返回相同的值。 b

  print(df.groupby(df.A // 2).A.apply(pd.Series.sample,n = 1))

A
0 0 0
1 2 2
名称:A,dtype:int6 4






/ em>

set_index

  cols = ['plant1_type','plant2_type','city2'] 
df.set_index(cols).groupby(level = cols)['p234_r_c']。\
nlargest(1) .reset_index()

plant1_type plant2_type city2 p234_r_c
0 COMBCYCL COAL多伦多5.0
1 COMBCYCL COAL底特律4.0
2 NUKE COMBCYCL St.Louis 2.0
3 Coal COMBCYCL Miami 0.5
4 NUKE COAL达拉斯1.0
5 COMBCYCL NUKE达拉斯4.0
6 NUKE达拉斯3.0


My df:

{'city1': {0: 'Chicago',
  1: 'Chicago',
  2: 'Chicago',
  3: 'Chicago',
  4: 'Miami',
  5: 'Houston',
  6: 'Austin'},
 'city2': {0: 'Toronto',
  1: 'Detroit',
  2: 'St.Louis',
  3: 'Miami',
  4: 'Dallas',
  5: 'Dallas',
  6: 'Dallas'},
 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
 'plant1_type': {0: 'COMBCYCL',
  1: 'COMBCYCL',
  2: 'NUKE',
  3: 'COAL',
  4: 'NUKE',
  5: 'COMBCYCL',
  6: 'COAL'},
 'plant2_type': {0: 'COAL',
  1: 'COAL',
  2: 'COMBCYCL',
  3: 'COMBCYCL',
  4: 'COAL',
  5: 'NUKE',
  6: 'NUKE'}}

I want to do 2 groupby operations and take the largest 1 of each group using column p234_r_c.

1st groupby = ['plant1_type', 'plant2_type', 'city1']

2nd groupby = ['plant1_type', 'plant2_type', 'city2']

As such I do the following:

df.groupby(['plant1_type','plant2_type','city1'])['p234_r_c'].\
    nlargest(1).reset_index()


plant1_type plant2_type city1   level_3 p234_r_c
0   COAL    COMBCYCL    Chicago 3   0.5
1   COAL    NUKE        Austin  6   3.0
2   COMBCYCL    COAL    Chicago 0   5.0
3   COMBCYCL    NUKE    Houston 5   4.0
4   NUKE    COAL        Miami   4   1.0
5   NUKE    COMBCYCL    Chicago 2   2.0

The result of the 1st groupby makes sense. However, I am confused by the result of the 2nd groupby:

df.groupby(['plant1_type','plant2_type','city2'])['p234_r_c'].\
    nlargest(1).reset_index()

index   p234_r_c
0   0   5.0
1   1   4.0
2   2   2.0
3   3   0.5
4   4   1.0
5   5   4.0
6   6   3.0

What happened to columns plant1_type, plant2_type and city2 in the result? Shouldnt they appear in the result just like how plant1_type, plant2_type and city1 appeared in the result of the 1st groupby?

解决方案

I added an issue here

Theory:

When the results of a groupby on a pd.Series returns the same pd.Series values, then the original index is returned.

Boiled down example

df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))

# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))

0    0
1    1
2    2
3    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

I'd argue that you want these to return the same consistent index.

This is the most egregious consequence of this:

# most egregious
# this will be randomly different
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))


returns this on one execution

A   
0  1    1
   0    0
1  2    2
   3    3
Name: A, dtype: int64

And this on another

0    0
1    1
2    2
3    3
Name: A, dtype: int64

Of course this never has an issue because it's impossible to return the same values as the original

print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=1))

A   
0  0    0
1  2    2
Name: A, dtype: int64


Work around
set_index

cols = ['plant1_type','plant2_type','city2']
df.set_index(cols).groupby(level=cols)['p234_r_c'].\
    nlargest(1).reset_index()

  plant1_type plant2_type     city2  p234_r_c
0    COMBCYCL        COAL   Toronto       5.0
1    COMBCYCL        COAL   Detroit       4.0
2        NUKE    COMBCYCL  St.Louis       2.0
3        COAL    COMBCYCL     Miami       0.5
4        NUKE        COAL    Dallas       1.0
5    COMBCYCL        NUKE    Dallas       4.0
6        COAL        NUKE    Dallas       3.0

这篇关于使用 pandas 选择每个组的列中最大的N个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆