在 pandas 中使用groupby和nlargest()更奇怪的结果 [英] More bizarre results using: groupby and nlargest() in pandas

查看:635
本文介绍了在 pandas 中使用groupby和nlargest()更奇怪的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是对以下帖子的扩展:使用熊猫选择每个群组的列的最大N值



让我们使用相同的df和提出的解决方法所选的答案。基本上,我试图做2组groupby操作,并选择每个组的最大N。但是,正如你在下面看到的,我得到了其中一个操作的错误。



鉴于原始文章在代码中发现了一个错误(看到这里),我想知道是否有另一个bug或其他相同错误的显示?



不幸的是,在这些问题得到解决和解决之前,我仍然坚持我的工作。我们能不能在这个问题上得到一些关注?直到明天我才能提供赏金。



df:

  {'city1':{0:'Chicago',
1:'Chicago',
2:'Chicago',
3: '芝加哥',
4:'迈阿密',
5:'休斯敦',
6:'奥斯汀'},
'city2':{0:'Toronto',
1:'底特律',
2:'圣路易斯',
3:'迈阿密',
4:'达拉斯',
5:'达拉斯',
6:'Dallas'},
'p234_r_c':{0:5.0,1:4.0,2:2.0,3:0.5,4:1.0,5:4.0,6:3.0} ,
'plant1_type':{0:'COMBCYCL',
1:'COMBCYCL',
2:'NUKE',
3:'COAL',
4:'NUKE',
5:'COMBCYCL',
6:'COAL'},
'plant2_type':{0:'COAL',
1:'COAL ',
2:'COMBCYCL',
3:'COMBCYCL',
4:'COAL',
5:'NUKE',
6:'NUKE '}}

您可以使用上述字典来生成吃掉df: pd.DataFrame(dct)



第一组:生成有意义的结果

  cols = ['city2','plant1_type','plant2_type'] 
df。 (1).reset_index()

city2 plant1_type plant2_type p234_r_c
0 Toronto COMBCYCL COAL 5.0
(cols).groupby(level = cols)['p234_r_c']。 1底特律COMBCYCL COAL 4.0
2 St.Louis NUKE COMBCYCL 2.0
3迈阿密煤矿COMBCYCL 0.5
4达拉斯NUKE COAL 1.0
5达拉斯COMBCYCL NUKE 4.0
6达拉斯煤NUKE 3.0

第二组:产生错误。唯一的区别是使用 city1 而不是 city2

  cols = ['city1','plant1_type','plant2_type'] 
df.set_index(cols)。 groupby(level = cols)['p234_r_c']。nlargest(1).reset_index()

错误结果:

  ----------------- -------------------------------------------------- -------- 
ValueError Traceback(最近一次调用最后一次)
< ipython-input-443-6426182b55e1> in< module>()
----> 1 test1.set_index(cols).groupby(level = cols)['p234_r_c']。nlargest(1).reset_index()

C:\Users\user1\Anaconda3\lib (self,level,drop,name,inplace)
967其他:
968 df = self.to_frame(name)
- > 969 return df.reset_index(level = level,drop = drop)
970
971 def __unicode __(self):

C:\ Users \ user1\Anconda3\\ (self,level,drop,inplace,col_level,col_fill)
2944 level_values = _maybe_casted_values(lev,lab)
2945如果等级为无或我在等级:
- > 2946 new_obj.insert(0,col_name,level_values)
2947
2948 elif not drop:

C:\ Users \ user1\Anaconda3\lib\site (self,loc,column,value,allow_duplicates)
2447 value = self._sanitize_column(column,value)
2448 self._data。 insert(loc,column,value,
- > 2449 allow_duplicates = allow_duplicates)
2450
2451 def assign(self,** kwargs):

C: \Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py在插入(自我,loc,item,value, item in self.items:
3509#这是否应该是另一种错误?
- > 3510 raise ValueError('不能插入%s,已经存在'%item)
3511
3512如果不是isinstance(loc,int):

ValueError:无法插入plant2_type,已经存在

最后:

如何在groupby的结果中使用 ['city2','plant1_type','plant2_type']来获取 city1 / code>和 city2 使用 ['city1','plant1_type','plant2_type']



我想知道groupby的 city1 值使用 ['city2', 'plant1_type','plant2_type'] 和相应的 city2 值为groupby使用 ['city1','plant1_type', 'plant2_type']



更新:

为什么下面的结果有完全不同的结构?唯一的区别是在#A中使用 city2 ,而在#B中使用 city1



A)

  cols = ['city2','plant1_type','plant2_type' ] 
test1.set_index(cols).groupby(level = cols)['p234_r_c']。nlargest(1)


city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
底特律COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami Coal COMBCYCL 0.5
达拉斯NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0 $





$ b

  cols2 = ['city1','plant1_type','plant2_type'] 
test1.set_index(cols2).groupby(level = cols2)['p234_r_c '] .nlargest(1)

city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin Coal NUKE奥斯汀煤炭NUKE 3.0
芝加哥煤炭COMBCYCL芝加哥煤炭COMBCYCL 0.5
COMBCYCL煤芝加哥COMBCYCL煤5.0
NUKE COMBCYCL芝加哥NUKE COMBCYCL 2.0
休斯顿COMBCYCL NUKE休斯敦COMBCYCL NUKE 4.0
迈阿密NUKE COAL迈阿密NUKE COAL 1.0
名称:p234_r_c,dtype:float64


解决方案

试试这个:

  In [76]:df.groupby(cols2)['p234_r_c' ] .nlargest(1).reset_index(level = 3,drop = True).reset_index()
Out [76]:
city1 plant1_type plant2_type p234_r_c
0 Austin COAL NUKE 3.0
1芝加哥煤炭公司0.5
2芝加哥COMBCYCL煤5.0
3芝加哥NUKE COMBCYCL 2.0
4休斯敦COMBCYCL NUKE 4.0
迈阿密NUKE COAL 1.0



坦率地说,我不明白以下行为:

 在[77]中:df.set_index(cols2).groupby(level = cols2)['p234_r_c']。nlargest (1)
出[77]:
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin煤炭NUKE Austin煤炭NUKE 3.0
芝加哥煤炭煤炭公司芝加哥煤炭公司0.5美元b $ b煤炭煤炭Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL芝加哥NUKE COMBCYCL 2.0
休斯顿COMBCYCL NUKE休斯顿COMBCYCL NUKE 4.0
迈阿密NUKE COAL迈阿密NUKE COAL 1.0
名称:p234_r_c,dtype:float64

其中:

 在[78]中:cols2 
Out [78]:['city1','plan t1_type','plant2_type']


This question is an extension of the following post: select largest N of a column of each groupby group using pandas

Lets use the same df and the workaround proposed in the selected answer. Basically, I am trying to do 2 groupby operations and select the nlargest N of each group. However as you can see below I get Errors for one of the operations.

Given that the original post discovered a bug in the code (see here), I am wondering whether there is another bug or another manifestation of same bug?

Unfortunately, I am at a stand still in my work until these issues are fixed and worked out. Can we kindly get some attention on this matter? I can't offer a bounty until tomorrow.

df:

{'city1': {0: 'Chicago',
  1: 'Chicago',
  2: 'Chicago',
  3: 'Chicago',
  4: 'Miami',
  5: 'Houston',
  6: 'Austin'},
 'city2': {0: 'Toronto',
  1: 'Detroit',
  2: 'St.Louis',
  3: 'Miami',
  4: 'Dallas',
  5: 'Dallas',
  6: 'Dallas'},
 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
 'plant1_type': {0: 'COMBCYCL',
  1: 'COMBCYCL',
  2: 'NUKE',
  3: 'COAL',
  4: 'NUKE',
  5: 'COMBCYCL',
  6: 'COAL'},
 'plant2_type': {0: 'COAL',
  1: 'COAL',
  2: 'COMBCYCL',
  3: 'COMBCYCL',
  4: 'COAL',
  5: 'NUKE',
  6: 'NUKE'}}

You can use the above dict to generate the df: pd.DataFrame(dct)

First groupby: Seems to generate results that make sense

cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

    city2   plant1_type plant2_type p234_r_c
0   Toronto COMBCYCL    COAL        5.0
1   Detroit COMBCYCL    COAL        4.0
2   St.Louis    NUKE    COMBCYCL    2.0
3   Miami   COAL        COMBCYCL    0.5
4   Dallas  NUKE        COAL        1.0
5   Dallas  COMBCYCL    NUKE        4.0
6   Dallas  COAL        NUKE        3.0

Second groupby: Produces an error. The only difference is city1 is used rather than city2.

cols = ['city1','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

Error result:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-443-6426182b55e1> in <module>()
----> 1 test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\series.py in reset_index(self, level, drop, name, inplace)
    967         else:
    968             df = self.to_frame(name)
--> 969             return df.reset_index(level=level, drop=drop)
    970 
    971     def __unicode__(self):

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   2944                     level_values = _maybe_casted_values(lev, lab)
   2945                     if level is None or i in level:
-> 2946                         new_obj.insert(0, col_name, level_values)
   2947 
   2948         elif not drop:

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
   2447         value = self._sanitize_column(column, value)
   2448         self._data.insert(loc, column, value,
-> 2449                           allow_duplicates=allow_duplicates)
   2450 
   2451     def assign(self, **kwargs):

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py in insert(self, loc, item, value, allow_duplicates)
   3508         if not allow_duplicates and item in self.items:
   3509             # Should this be a different kind of error??
-> 3510             raise ValueError('cannot insert %s, already exists' % item)
   3511 
   3512         if not isinstance(loc, int):

ValueError: cannot insert plant2_type, already exists

Lastly:

How can I get the city1 column in the result of groupby using ['city2','plant1_type','plant2_type'] and city2 column in the result of groupby using ['city1','plant1_type','plant2_type']?

I want to know the corresponding city1 value for groupby using ['city2','plant1_type','plant2_type'] and corresponding city2 value for groupby using ['city1','plant1_type','plant2_type'].

UPDATE:

Why are the results of the following have completely different structures? The only difference is that city2 is used in #A while city1 is used in #B.

A)

cols = ['city2','plant1_type','plant2_type']
test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)


city2     plant1_type  plant2_type
Toronto   COMBCYCL     COAL           5.0
Detroit   COMBCYCL     COAL           4.0
St.Louis  NUKE         COMBCYCL       2.0
Miami     COAL         COMBCYCL       0.5
Dallas    NUKE         COAL           1.0
          COMBCYCL     NUKE           4.0
          COAL         NUKE           3.0
Name: p234_r_c, dtype: float64

B)

cols2 = ['city1','plant1_type','plant2_type']
test1.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)

city1    plant1_type  plant2_type  city1    plant1_type  plant2_type
Austin   COAL         NUKE         Austin   COAL         NUKE           3.0
Chicago  COAL         COMBCYCL     Chicago  COAL         COMBCYCL       0.5
         COMBCYCL     COAL         Chicago  COMBCYCL     COAL           5.0
         NUKE         COMBCYCL     Chicago  NUKE         COMBCYCL       2.0
Houston  COMBCYCL     NUKE         Houston  COMBCYCL     NUKE           4.0
Miami    NUKE         COAL         Miami    NUKE         COAL           1.0
Name: p234_r_c, dtype: float64

解决方案

Try this:

In [76]: df.groupby(cols2)['p234_r_c'].nlargest(1).reset_index(level=3, drop=True).reset_index()
Out[76]:
     city1 plant1_type plant2_type  p234_r_c
0   Austin        COAL        NUKE       3.0
1  Chicago        COAL    COMBCYCL       0.5
2  Chicago    COMBCYCL        COAL       5.0
3  Chicago        NUKE    COMBCYCL       2.0
4  Houston    COMBCYCL        NUKE       4.0
5    Miami        NUKE        COAL       1.0

Frankly speaking I don't understand the following behavior:

In [77]: df.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)
Out[77]:
city1    plant1_type  plant2_type  city1    plant1_type  plant2_type
Austin   COAL         NUKE         Austin   COAL         NUKE           3.0
Chicago  COAL         COMBCYCL     Chicago  COAL         COMBCYCL       0.5
         COMBCYCL     COAL         Chicago  COMBCYCL     COAL           5.0
         NUKE         COMBCYCL     Chicago  NUKE         COMBCYCL       2.0
Houston  COMBCYCL     NUKE         Houston  COMBCYCL     NUKE           4.0
Miami    NUKE         COAL         Miami    NUKE         COAL           1.0
Name: p234_r_c, dtype: float64

where:

In [78]: cols2
Out[78]: ['city1', 'plant1_type', 'plant2_type']

这篇关于在 pandas 中使用groupby和nlargest()更奇怪的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆