使用 pandas 选择每个组的列中最大的N个 [英] select largest N of a column of each groupby group using pandas

查看：102 发布时间：2018/5/30 13:49:25 python pandas group-by

本文介绍了使用 pandas 选择每个组的列中最大的N个的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的df：

  {'city1'：{0：'Chicago'，
 1：'Chicago '，
 2：'芝加哥'，
 3：'芝加哥'，
 4：'迈阿密'，
 5：'休斯敦'，
 6：'奥斯汀''，
'city2'：{0：'多伦多'，
 1：'底特律'，
 2：'圣路易斯'，$ b $ 3：'迈阿密'， 
 4：'Dallas'，
 5：'Dallas'，
 6：'Dallas'}，
'p234_r_c'：{0：5.0，1：4.0，2： 2.0，3：0.5,4：1.0,5：4.0,6：3.0}，
'plant1_type'：{0：'COMBCYCL'，
 1：'COMBCYCL'，
 2： 'NUKE'，
 3：'COAL'，
 4：'NUKE'，
 5：'COMBCYCL'，
 6：'COAL'}，
' plant2_type'：{0：'COAL'，
 1：'COAL'，
 2：'COMBCYCL'，
 3：'COMBCYCL'，
 4：'COAL'， 
 5：'NUKE'，
 6：'NUKE'}}

我想要做2个groupby操作，并使用列 p234_r_c 来获取每个组中最大的1个。

1st groupby = ['plant1_type'，'plant2_type'，'city1']

2nd groupby = ['plant1_type'，'plant2_type'，'city2']

以下内容：

  df.groupby（['plant1_type'，'plant2_type'，'city1']）['p234_r_c']。 \ 
 nlargest（1）.reset_index（）
 
 
 plant1_type plant2_type city1 level_3 p234_r_c 
 0 Coal COMBCYCL Chicago 3 0.5 
 1 COAL NUKE Austin 6 3.0 
 2 COMBCYCL COAL芝加哥0 5.0 
 3 COMBCYCL NUKE休斯敦5 4.0 
 4 NUKE COAL迈阿密4 1.0 
 5 NUKE COMBCYCL芝加哥2 2.0

第一组的结果是有意义的。不过，我对第二组的结果感到困惑：

  df.groupby（['plant1_type'，'plant2_type'， 'city2']）['p234_r_c']。\ 
 nlargest（1）.reset_index（）
 
 index p234_r_c 
 0 0 5.0 
 1 1 4.0 
 2 2 2.0 
 3 3 0.5 
 4 4 1.0 
 5 5 4.0 
 6 6 3.0

列 plant1_type ， plant2_type 和<$ c发生了什么$ c> city2 在结果中？不应该出现在结果中，就像 plant1_type ， plant2_type 和 city1 出现在第一组的结果中？

解决方案

我在这里添加了一个问题

理论：

当 pd.Series 上的 groupby c>返回相同的 pd.Series 值，然后返回原始索引。

下面的例子

  df = pd.DataFrame（dict（A = [0,1,2,3]） ）
 
＃返回与df.A 
相同的结果print（df.groupby（df.A // 2）.A.nsmallest（2））
 
 ＃返回结果乱序
 print（df.groupby（df.A // 2）.A.nlargest（2））
 
 0 0 
 1 1 
 2 2 
 3 3 
名称：A，dtype：int64 
 A 
 0 1 1 
 0 0 
 1 3 3 
 2 2 
名称：A，dtype：int64

我会说你想要这些返回相同的一致索引。

这是最令人震惊的结果：＃最令人震惊的＃这将是随机不同的 print（df.groupby（df.A // 2）.A.apply（pd.Series.sample，n = 2））在一次执行中返回此值 A 0 1 1 0 0 1 2 2 3 3 名称：A，dtype：int64 另外还有一个 0 0 1 1 2 2 3 3 名称：A，dtype：int64 当然，这永远不会有问题，因为不可能像原始的那样返回相同的值。 b print（df.groupby（df.A // 2）.A.apply（pd.Series.sample，n = 1）） A 0 0 0 1 2 2 名称：A，dtype：int6 4 / em> set_index cols = ['plant1_type'，'plant2_type'，'city2'] df.set_index（cols）.groupby（level = cols）['p234_r_c']。\ nlargest（1） .reset_index（） plant1_type plant2_type city2 p234_r_c 0 COMBCYCL COAL多伦多5.0 1 COMBCYCL COAL底特律4.0 2 NUKE COMBCYCL St.Louis 2.0 3 Coal COMBCYCL Miami 0.5 4 NUKE COAL达拉斯1.0 5 COMBCYCL NUKE达拉斯4.0 6 NUKE达拉斯3.0 My df: {'city1': {0: 'Chicago', 1: 'Chicago', 2: 'Chicago', 3: 'Chicago', 4: 'Miami', 5: 'Houston', 6: 'Austin'}, 'city2': {0: 'Toronto', 1: 'Detroit', 2: 'St.Louis', 3: 'Miami', 4: 'Dallas', 5: 'Dallas', 6: 'Dallas'}, 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0}, 'plant1_type': {0: 'COMBCYCL', 1: 'COMBCYCL', 2: 'NUKE', 3: 'COAL', 4: 'NUKE', 5: 'COMBCYCL', 6: 'COAL'}, 'plant2_type': {0: 'COAL', 1: 'COAL', 2: 'COMBCYCL', 3: 'COMBCYCL', 4: 'COAL', 5: 'NUKE', 6: 'NUKE'}} I want to do 2 groupby operations and take the largest 1 of each group using column p234_r_c. 1st groupby = ['plant1_type', 'plant2_type', 'city1'] 2nd groupby = ['plant1_type', 'plant2_type', 'city2'] As such I do the following: df.groupby(['plant1_type','plant2_type','city1'])['p234_r_c'].\ nlargest(1).reset_index() plant1_type plant2_type city1 level_3 p234_r_c 0 COAL COMBCYCL Chicago 3 0.5 1 COAL NUKE Austin 6 3.0 2 COMBCYCL COAL Chicago 0 5.0 3 COMBCYCL NUKE Houston 5 4.0 4 NUKE COAL Miami 4 1.0 5 NUKE COMBCYCL Chicago 2 2.0 The result of the 1st groupby makes sense. However, I am confused by the result of the 2nd groupby: df.groupby(['plant1_type','plant2_type','city2'])['p234_r_c'].\ nlargest(1).reset_index() index p234_r_c 0 0 5.0 1 1 4.0 2 2 2.0 3 3 0.5 4 4 1.0 5 5 4.0 6 6 3.0 What happened to columns plant1_type, plant2_type and city2 in the result? Shouldnt they appear in the result just like how plant1_type, plant2_type and city1 appeared in the result of the 1st groupby? 解决方案 I added an issue here Theory: When the results of a groupby on a pd.Series returns the same pd.Series values, then the original index is returned. Boiled down example df = pd.DataFrame(dict(A=[0, 1, 2, 3])) # returns results identical to df.A print(df.groupby(df.A // 2).A.nsmallest(2)) # returns results out of order print(df.groupby(df.A // 2).A.nlargest(2)) 0 0 1 1 2 2 3 3 Name: A, dtype: int64 A 0 1 1 0 0 1 3 3 2 2 Name: A, dtype: int64 I'd argue that you want these to return the same consistent index. This is the most egregious consequence of this: # most egregious # this will be randomly different print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2)) returns this on one execution A 0 1 1 0 0 1 2 2 3 3 Name: A, dtype: int64 And this on another 0 0 1 1 2 2 3 3 Name: A, dtype: int64 Of course this never has an issue because it's impossible to return the same values as the original print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=1)) A 0 0 0 1 2 2 Name: A, dtype: int64 Work around set_index cols = ['plant1_type','plant2_type','city2'] df.set_index(cols).groupby(level=cols)['p234_r_c'].\ nlargest(1).reset_index() plant1_type plant2_type city2 p234_r_c 0 COMBCYCL COAL Toronto 5.0 1 COMBCYCL COAL Detroit 4.0 2 NUKE COMBCYCL St.Louis 2.0 3 COAL COMBCYCL Miami 0.5 4 NUKE COAL Dallas 1.0 5 COMBCYCL NUKE Dallas 4.0 6 COAL NUKE Dallas 3.0 这篇关于使用 pandas 选择每个组的列中最大的N个的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用 pandas 选择每个组的列中最大的N个 [英] select largest N of a column of each groupby group using pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 pandas 选择每个组的列中最大的N个 [英] select largest N of a column of each groupby group using pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭