在groupby（）。最大（x）操作后切片原始df [英] slice original df after groupby().nlargest(x) operation

查看：230 发布时间：2018/5/30 13:54:03 python pandas group-by

本文介绍了在groupby（）。最大（x）操作后切片原始df的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

groupby（）和 nlargest（）存在问题此处和此处。我试图解决这些问题。注意：为简单起见，我使用了 nlargest（1），但是，它可以是任意数量的选择。

  {'city1'：{0：'Chicago'，
 1：'Chicago'，
 2：'芝加哥'，
 3：'芝加哥'，
 4：'迈阿密'，
 5：'休斯敦'，
 6：'奥斯汀'}，
'city2'：{0：'多伦多'，
 1：'底特律'，
 2：'St.Louis'，
 3：'迈阿密'，
 4： 'Dallas'，
 5：'Dallas'，
 6：'Dallas'}，
'p234_r_c'：{0：5.0,1：4.0,2：2.0,3：0.5， 4：1.0,5：4.0,6：3.0}，
'plant1_type'：{0：'COMBCYCL'，
 1：'COMBCYCL'，
 2：'NUKE'，
 3：'COAL'，
 4：'NUKE'，
 5：'COMBCYCL'，
 6：'COAL'}，
'plant2_type'：{0： '煤'，
 1：'煤'，
 2：'COMBCYCL'，
 3：'COMBCYCL'，
 4：'煤'，
 5： 'NUKE'，
 6：'NUKE'，}}

groupby city1 并返回从原始df中选择的行

  cols2 = ['city1'，  plant1_type'，'plant2_type'] 
 df.loc [df2.groupby（cols2）['p234_r_c']。nlargest（1）.reset_index（）。level_3] 
 
 city1 city2 p234_r_c plant1_type plant2_type 
 6 Austin Dallas 3.0 COAL NUKE 
 3芝加哥迈阿密0.5 COAL COMBCYCL 
 0芝加哥多伦多5.0 COMBCYCL煤炭
 2芝加哥圣路易斯2.0 NUKE COMBCYCL 
 5休斯顿达拉斯4.0 COMBCYCL NUKE 
 4迈阿密达拉斯1.0 NUKE COAL

以上看起来不错

b）groupby city2 并返回从原始df中选择的行

由于#A中使用的代码在尝试groupby city2 时会生成伪造结果，因此建议采取以下解决方法：

cols = ['city2'，'plant1_type'，'plant2_type'] df.set_index（cols）.groupby（level = cols）['p234_r_c '] .nlargest（1） city2 plant1_type plant2_type 多伦多COMBCYCL COAL 5.0 底特律COMBCYCL COAL 4.0 St.Louis NUKE COMBCYCL 2.0 迈阿密煤炭公司0.5 达拉斯NUKE COAL 1.0 COMBCYCL NUKE 4.0 COAL NUKE 3.0
现在我该如何使用这个结果来返回从原来的df中选择行，就像我在#A中所做的那样？

注意：原来的df有一个额外的行， groupby.nlargest（）对于 city2 有至少一个组的大小大于1的组， code> #A 可用于 #B 。
解决方案
除非我错过了一些东西（我同意这里有潜伏在熊猫代码中的错误），我们可以相对简单地绕过任何困难。

Met的hod＃1：使用 loc 和 idxmax ：

在[21]中：df.loc [df.groupby（cols2）[p234_r_c]。idxmax（）]
Out [21]：
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
3芝加哥迈阿密0.5 COAL COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭
2芝加哥圣路易斯2.0 NUKE COMBCYCL
5休斯顿达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL

在[22]：df.loc [df.groupby（cols）[p234_r_c]。idxmax（）]
出[22]：
city1 city2 p234_r_c plant1_type plant2_type
6奥斯汀达拉斯3.0 COAL NUKE
5休斯敦达拉斯4.0 COMBCYCL NUKE
4迈阿密达拉斯1.0 NUKE COAL
1芝加哥底特律4.0 COMBCYCL煤炭
3芝加哥迈阿密0.5煤炭COMBCYCL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
0芝加哥多伦多5.0 COMBCYCL煤炭

<方法2：按 p234_r_c 进行排序，并使用上一个：
In [17]：df.sort_values（p234_r_c）。groupby（cols2，as_index = False）.last（） Out [17]： city1 plant1_type plant2_type city2 p234_r_c 0 Austin COAL NUKE达拉斯3.0 1芝加哥COAL COMBCYCL迈阿密0.5 2芝加哥COMBCYCL COAL多伦多5.0 3芝加哥NUKE COMBCYCL St.Louis 2.0 4休斯敦COMBCYCL NUKE达拉斯4.0 迈阿密NUKE COAL达拉斯1.0 在[18]：df.sort_values（p234_r_c）。groupby（cols，as_index = False）。 last（） Out [18]： city2 plant1_type plant2_type city1 p234_r_c 0 Dallas Coal NUKE Austin 3.0 1达拉斯COMBCYCL NUKE休斯敦4.0 2达拉斯NUKE COAL迈阿密1.0 3底特律COMBCYCL COAL芝加哥4.0 4迈阿密煤炭COMBCYCL芝加哥0.5 5 St.Louis NUKE COMBCYCL芝加哥2.0 6 Toronto COMBCYCL COAL芝加哥5.0

如果你想要为了能够获得多种回应，而最小和最小的数据都被破坏了，我认为最简单的方法是排序然后使用头部或尾部。例如：

在[27]中：df.sort_values（p234_r_c）。groupby（cols，as_index = False）。尾巴（2）出库[27]： city1 city2 p234_r_c plant1_type plant2_type 3芝加哥迈阿密0.5煤矿开采 4迈阿密达拉斯1.0 NUKE煤矿 2芝加哥街路易斯2.0 NUKE COMBCYCL 6奥斯汀达拉斯3.0 COAL NUKE 1芝加哥底特律4.0 COMBCYCL煤炭 5休斯顿达拉斯4.0 COMBCYCL NUKE 0芝加哥多伦多5.0 COMBCYCL煤炭
Given the problems with groupby() and nlargest() as described here and here. I am trying to work around the problems.
Note: for simplicity I used nlargest(1), however, it could be any number of selections. {'city1': {0: 'Chicago', 1: 'Chicago', 2: 'Chicago', 3: 'Chicago', 4: 'Miami', 5: 'Houston', 6: 'Austin'}, 'city2': {0: 'Toronto', 1: 'Detroit', 2: 'St.Louis', 3: 'Miami', 4: 'Dallas', 5: 'Dallas', 6: 'Dallas'}, 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0}, 'plant1_type': {0: 'COMBCYCL', 1: 'COMBCYCL', 2: 'NUKE', 3: 'COAL', 4: 'NUKE', 5: 'COMBCYCL', 6: 'COAL'}, 'plant2_type': {0: 'COAL', 1: 'COAL', 2: 'COMBCYCL', 3: 'COMBCYCL', 4: 'COAL', 5: 'NUKE', 6: 'NUKE',}} A) groupby city1 and return rows selected from original df cols2 = ['city1','plant1_type','plant2_type'] df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3] city1 city2 p234_r_c plant1_type plant2_type 6 Austin Dallas 3.0 COAL NUKE 3 Chicago Miami 0.5 COAL COMBCYCL 0 Chicago Toronto 5.0 COMBCYCL COAL 2 Chicago St.Louis 2.0 NUKE COMBCYCL 5 Houston Dallas 4.0 COMBCYCL NUKE 4 Miami Dallas 1.0 NUKE COAL The above looks good B) groupby city2 and return rows selected from original df Since the same code used in #A generates a bogus result if attempted for groupby of city2, a workaround was suggested to do the following: cols = ['city2','plant1_type','plant2_type'] df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1) city2 plant1_type plant2_type Toronto COMBCYCL COAL 5.0 Detroit COMBCYCL COAL 4.0 St.Louis NUKE COMBCYCL 2.0 Miami COAL COMBCYCL 0.5 Dallas NUKE COAL 1.0 COMBCYCL NUKE 4.0 COAL NUKE 3.0 Now how do I use this result to return the rows selected from the original df as I did in #A? Note: had the original df had an additional row that made the result of groupby.nlargest() for city2 have groups where at least one group had size greater than 1 then the code in #A can be used for #B. 解决方案 Unless I'm missing something (and I agree there are bugs lurking in the pandas code here), we can bypass any difficulties relatively simply. Method #1: use loc and idxmax: In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()] Out[21]: city1 city2 p234_r_c plant1_type plant2_type 6 Austin Dallas 3.0 COAL NUKE 3 Chicago Miami 0.5 COAL COMBCYCL 0 Chicago Toronto 5.0 COMBCYCL COAL 2 Chicago St.Louis 2.0 NUKE COMBCYCL 5 Houston Dallas 4.0 COMBCYCL NUKE 4 Miami Dallas 1.0 NUKE COAL In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()] Out[22]: city1 city2 p234_r_c plant1_type plant2_type 6 Austin Dallas 3.0 COAL NUKE 5 Houston Dallas 4.0 COMBCYCL NUKE 4 Miami Dallas 1.0 NUKE COAL 1 Chicago Detroit 4.0 COMBCYCL COAL 3 Chicago Miami 0.5 COAL COMBCYCL 2 Chicago St.Louis 2.0 NUKE COMBCYCL 0 Chicago Toronto 5.0 COMBCYCL COAL Method #2: sort by p234_r_c and use last: In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last() Out[17]: city1 plant1_type plant2_type city2 p234_r_c 0 Austin COAL NUKE Dallas 3.0 1 Chicago COAL COMBCYCL Miami 0.5 2 Chicago COMBCYCL COAL Toronto 5.0 3 Chicago NUKE COMBCYCL St.Louis 2.0 4 Houston COMBCYCL NUKE Dallas 4.0 5 Miami NUKE COAL Dallas 1.0 In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last() Out[18]: city2 plant1_type plant2_type city1 p234_r_c 0 Dallas COAL NUKE Austin 3.0 1 Dallas COMBCYCL NUKE Houston 4.0 2 Dallas NUKE COAL Miami 1.0 3 Detroit COMBCYCL COAL Chicago 4.0 4 Miami COAL COMBCYCL Chicago 0.5 5 St.Louis NUKE COMBCYCL Chicago 2.0 6 Toronto COMBCYCL COAL Chicago 5.0 If you want to be able to get multiple responses as well, while nlargest and nsmallest are broken, I think it's simplest to sort and then use head or tail. For example: In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2) Out[27]: city1 city2 p234_r_c plant1_type plant2_type 3 Chicago Miami 0.5 COAL COMBCYCL 4 Miami Dallas 1.0 NUKE COAL 2 Chicago St.Louis 2.0 NUKE COMBCYCL 6 Austin Dallas 3.0 COAL NUKE 1 Chicago Detroit 4.0 COMBCYCL COAL 5 Houston Dallas 4.0 COMBCYCL NUKE 0 Chicago Toronto 5.0 COMBCYCL COAL 这篇关于在groupby（）。最大（x）操作后切片原始df的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在groupby（）。最大（x）操作后切片原始df [英] slice original df after groupby().nlargest(x) operation

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在groupby（）。最大（x）操作后切片原始df [英] slice original df after groupby().nlargest(x) operation

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭