在groupby()。最大(x)操作后切片原始df [英] slice original df after groupby().nlargest(x) operation
问题描述
groupby()和 nlargest()
存在问题此处和此处。我试图解决这些问题。注意:为简单起见,我使用了 nlargest(1)
,但是,它可以是任意数量的选择。
{'city1':{0:'Chicago',
1:'Chicago',
2:'芝加哥',
3:'芝加哥',
4:'迈阿密',
5:'休斯敦',
6:'奥斯汀'},
'city2':{0:'多伦多',
1:'底特律',
2:'St.Louis',
3:'迈阿密',
4: 'Dallas',
5:'Dallas',
6:'Dallas'},
'p234_r_c':{0:5.0,1:4.0,2:2.0,3:0.5, 4:1.0,5:4.0,6:3.0},
'plant1_type':{0:'COMBCYCL',
1:'COMBCYCL',
2:'NUKE',
3:'COAL',
4:'NUKE',
5:'COMBCYCL',
6:'COAL'},
'plant2_type':{0: '煤',
1:'煤',
2:'COMBCYCL',
3:'COMBCYCL',
4:'煤',
5: 'NUKE',
6:'NUKE',}}
cols2 = ['city1', plant1_type','plant2_type']
df.loc [df2.groupby(cols2)['p234_r_c']。nlargest(1).reset_index()。level_3]
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3芝加哥迈阿密0.5 COAL COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
2芝加哥圣路易斯2.0 NUKE COMBCYCL
5休斯顿达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL
以上看起来不错
b)groupby
city2
并返回从原始df中选择的行由于#A中使用的代码在尝试groupby city2
时会生成伪造结果,因此建议采取以下解决方法:
cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level = cols)['p234_r_c '] .nlargest(1)
city2 plant1_type plant2_type
多伦多COMBCYCL COAL 5.0
底特律COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
迈阿密煤炭公司0.5
达拉斯NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0
现在我该如何使用这个结果来返回从原来的df中选择行,就像我在#A中所做的那样?
注意:原来的df有一个额外的行, groupby.nlargest()
对于 city2
有至少一个组的大小大于1的组, code> #A 可用于 #B
。
除非我错过了一些东西(我同意这里有潜伏在熊猫代码中的错误),我们可以相对简单地绕过任何困难。
Met的hod#1:使用 loc
和 idxmax
:
在[21]中:df.loc [df.groupby(cols2)[p234_r_c]。idxmax()]
Out [21]:
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
3芝加哥迈阿密0.5 COAL COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
2芝加哥圣路易斯2.0 NUKE COMBCYCL
5休斯顿达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL
在[22]:df.loc [df.groupby(cols)[p234_r_c]。idxmax()]
出[22]:
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
5休斯敦达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL
1芝加哥底特律4.0 COMBCYCL煤炭
3芝加哥迈阿密0.5煤炭COMBCYCL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
<方法2:按 p234_r_c
进行排序,并使用上一个
:
In [17]:df.sort_values(p234_r_c)。groupby(cols2,as_index = False).last()
Out [17]:
city1 plant1_type plant2_type city2 p234_r_c
0 Austin COAL NUKE达拉斯3.0
1芝加哥COAL COMBCYCL迈阿密0.5
2芝加哥COMBCYCL COAL多伦多5.0
3芝加哥NUKE COMBCYCL St.Louis 2.0
4休斯敦COMBCYCL NUKE达拉斯4.0
迈阿密NUKE COAL达拉斯1.0
在[18]:df.sort_values(p234_r_c)。groupby(cols,as_index = False)。 last()
Out [18]:
city2 plant1_type plant2_type city1 p234_r_c
0 Dallas Coal NUKE Austin 3.0
1达拉斯COMBCYCL NUKE休斯敦4.0
2达拉斯NUKE COAL迈阿密1.0
3底特律COMBCYCL COAL芝加哥4.0
4迈阿密煤炭COMBCYCL芝加哥0.5
5 St.Louis NUKE COMBCYCL芝加哥2.0
6 Toronto COMBCYCL COAL芝加哥5.0
如果你想要为了能够获得多种回应,而最小和最小的数据都被破坏了,我认为最简单的方法是排序然后使用头部或尾部。例如:
在[27]中:df.sort_values(p234_r_c)。groupby(cols,as_index = False)。尾巴(2)
出库[27]:
city1 city2 p234_r_c plant1_type plant2_type
3芝加哥迈阿密0.5煤矿开采
4迈阿密达拉斯1.0 NUKE煤矿
2芝加哥街路易斯2.0 NUKE COMBCYCL
6奥斯汀达拉斯3.0 COAL NUKE
1芝加哥底特律4.0 COMBCYCL煤炭
5休斯顿达拉斯4.0 COMBCYCL NUKE
0芝加哥多伦多5.0 COMBCYCL煤炭
Given the problems with groupby()
and nlargest()
as described here and here. I am trying to work around the problems.
Note: for simplicity I used nlargest(1)
, however, it could be any number of selections.
{'city1': {0: 'Chicago',
1: 'Chicago',
2: 'Chicago',
3: 'Chicago',
4: 'Miami',
5: 'Houston',
6: 'Austin'},
'city2': {0: 'Toronto',
1: 'Detroit',
2: 'St.Louis',
3: 'Miami',
4: 'Dallas',
5: 'Dallas',
6: 'Dallas'},
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
'plant1_type': {0: 'COMBCYCL',
1: 'COMBCYCL',
2: 'NUKE',
3: 'COAL',
4: 'NUKE',
5: 'COMBCYCL',
6: 'COAL'},
'plant2_type': {0: 'COAL',
1: 'COAL',
2: 'COMBCYCL',
3: 'COMBCYCL',
4: 'COAL',
5: 'NUKE',
6: 'NUKE',}}
A) groupby city1
and return rows selected from original df
cols2 = ['city1','plant1_type','plant2_type']
df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3]
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3 Chicago Miami 0.5 COAL COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
The above looks good
B) groupby city2
and return rows selected from original df
Since the same code used in #A generates a bogus result if attempted for groupby of city2
, a workaround was suggested to do the following:
cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)
city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
Detroit COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami COAL COMBCYCL 0.5
Dallas NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0
Now how do I use this result to return the rows selected from the original df as I did in #A?
Note: had the original df had an additional row that made the result of groupby.nlargest()
for city2
have groups where at least one group had size greater than 1 then the code in #A
can be used for #B
.
Unless I'm missing something (and I agree there are bugs lurking in the pandas code here), we can bypass any difficulties relatively simply.
Method #1: use loc
and idxmax
:
In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()]
Out[21]:
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3 Chicago Miami 0.5 COAL COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()]
Out[22]:
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
1 Chicago Detroit 4.0 COMBCYCL COAL
3 Chicago Miami 0.5 COAL COMBCYCL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
Method #2: sort by p234_r_c
and use last
:
In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last()
Out[17]:
city1 plant1_type plant2_type city2 p234_r_c
0 Austin COAL NUKE Dallas 3.0
1 Chicago COAL COMBCYCL Miami 0.5
2 Chicago COMBCYCL COAL Toronto 5.0
3 Chicago NUKE COMBCYCL St.Louis 2.0
4 Houston COMBCYCL NUKE Dallas 4.0
5 Miami NUKE COAL Dallas 1.0
In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last()
Out[18]:
city2 plant1_type plant2_type city1 p234_r_c
0 Dallas COAL NUKE Austin 3.0
1 Dallas COMBCYCL NUKE Houston 4.0
2 Dallas NUKE COAL Miami 1.0
3 Detroit COMBCYCL COAL Chicago 4.0
4 Miami COAL COMBCYCL Chicago 0.5
5 St.Louis NUKE COMBCYCL Chicago 2.0
6 Toronto COMBCYCL COAL Chicago 5.0
If you want to be able to get multiple responses as well, while nlargest and nsmallest are broken, I think it's simplest to sort and then use head or tail. For example:
In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2)
Out[27]:
city1 city2 p234_r_c plant1_type plant2_type
3 Chicago Miami 0.5 COAL COMBCYCL
4 Miami Dallas 1.0 NUKE COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
6 Austin Dallas 3.0 COAL NUKE
1 Chicago Detroit 4.0 COMBCYCL COAL
5 Houston Dallas 4.0 COMBCYCL NUKE
0 Chicago Toronto 5.0 COMBCYCL COAL
这篇关于在groupby()。最大(x)操作后切片原始df的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!