在 pandas 中使用groupby和nlargest()更奇怪的结果 [英] More bizarre results using: groupby and nlargest() in pandas
问题描述
这个问题是对以下帖子的扩展:使用熊猫选择每个群组的列的最大N值
让我们使用相同的df和提出的解决方法所选的答案。基本上,我试图做2组groupby操作,并选择每个组的最大N。但是,正如你在下面看到的,我得到了其中一个操作的错误。
鉴于原始文章在代码中发现了一个错误(看到这里),我想知道是否有另一个bug或其他相同错误的显示?
不幸的是,在这些问题得到解决和解决之前,我仍然坚持我的工作。我们能不能在这个问题上得到一些关注?直到明天我才能提供赏金。
df:
{'city1':{0:'Chicago',
1:'Chicago',
2:'Chicago',
3: '芝加哥',
4:'迈阿密',
5:'休斯敦',
6:'奥斯汀'},
'city2':{0:'Toronto',
1:'底特律',
2:'圣路易斯',
3:'迈阿密',
4:'达拉斯',
5:'达拉斯',
6:'Dallas'},
'p234_r_c':{0:5.0,1:4.0,2:2.0,3:0.5,4:1.0,5:4.0,6:3.0} ,
'plant1_type':{0:'COMBCYCL',
1:'COMBCYCL',
2:'NUKE',
3:'COAL',
4:'NUKE',
5:'COMBCYCL',
6:'COAL'},
'plant2_type':{0:'COAL',
1:'COAL ',
2:'COMBCYCL',
3:'COMBCYCL',
4:'COAL',
5:'NUKE',
6:'NUKE '}}
您可以使用上述字典来生成吃掉df: pd.DataFrame(dct)
第一组:生成有意义的结果
cols = ['city2','plant1_type','plant2_type']
df。 (1).reset_index()
city2 plant1_type plant2_type p234_r_c
0 Toronto COMBCYCL COAL 5.0
(cols).groupby(level = cols)['p234_r_c']。 1底特律COMBCYCL COAL 4.0
2 St.Louis NUKE COMBCYCL 2.0
3迈阿密煤矿COMBCYCL 0.5
4达拉斯NUKE COAL 1.0
5达拉斯COMBCYCL NUKE 4.0
6达拉斯煤NUKE 3.0
第二组:产生错误。唯一的区别是使用 city1
而不是 city2
。
cols = ['city1','plant1_type','plant2_type']
df.set_index(cols)。 groupby(level = cols)['p234_r_c']。nlargest(1).reset_index()
错误结果:
----------------- -------------------------------------------------- --------
ValueError Traceback(最近一次调用最后一次)
< ipython-input-443-6426182b55e1> in< module>()
----> 1 test1.set_index(cols).groupby(level = cols)['p234_r_c']。nlargest(1).reset_index()
C:\Users\user1\Anaconda3\lib (self,level,drop,name,inplace)
967其他:
968 df = self.to_frame(name)
- > 969 return df.reset_index(level = level,drop = drop)
970
971 def __unicode __(self):
C:\ Users \ user1\Anconda3\\ (self,level,drop,inplace,col_level,col_fill)
2944 level_values = _maybe_casted_values(lev,lab)
2945如果等级为无或我在等级:
- > 2946 new_obj.insert(0,col_name,level_values)
2947
2948 elif not drop:
C:\ Users \ user1\Anaconda3\lib\site (self,loc,column,value,allow_duplicates)
2447 value = self._sanitize_column(column,value)
2448 self._data。 insert(loc,column,value,
- > 2449 allow_duplicates = allow_duplicates)
2450
2451 def assign(self,** kwargs):
C: \Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py在插入(自我,loc,item,value, item in self.items:
3509#这是否应该是另一种错误?
- > 3510 raise ValueError('不能插入%s,已经存在'%item)
3511
3512如果不是isinstance(loc,int):
ValueError:无法插入plant2_type,已经存在
最后:
如何在groupby的结果中使用 ['city2','plant1_type','plant2_type']来获取
city1
/ code>和 city2
使用 ['city1','plant1_type','plant2_type']
?
我想知道groupby的 city1
值使用 ['city2', 'plant1_type','plant2_type']
和相应的 city2
值为groupby使用 ['city1','plant1_type', 'plant2_type']
。
更新:
为什么下面的结果有完全不同的结构?唯一的区别是在#A中使用 city2
,而在#B中使用 city1
。
A)
cols = ['city2','plant1_type','plant2_type' ]
test1.set_index(cols).groupby(level = cols)['p234_r_c']。nlargest(1)
city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
底特律COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami Coal COMBCYCL 0.5
达拉斯NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0 $
$ b cols2 = ['city1','plant1_type','plant2_type']
test1.set_index(cols2).groupby(level = cols2)['p234_r_c '] .nlargest(1)
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin Coal NUKE奥斯汀煤炭NUKE 3.0
芝加哥煤炭COMBCYCL芝加哥煤炭COMBCYCL 0.5
COMBCYCL煤芝加哥COMBCYCL煤5.0
NUKE COMBCYCL芝加哥NUKE COMBCYCL 2.0
休斯顿COMBCYCL NUKE休斯敦COMBCYCL NUKE 4.0
迈阿密NUKE COAL迈阿密NUKE COAL 1.0
名称:p234_r_c,dtype:float64
解决方案 试试这个:
In [76]:df.groupby(cols2)['p234_r_c' ] .nlargest(1).reset_index(level = 3,drop = True).reset_index()
Out [76]:
city1 plant1_type plant2_type p234_r_c
0 Austin COAL NUKE 3.0
1芝加哥煤炭公司0.5
2芝加哥COMBCYCL煤5.0
3芝加哥NUKE COMBCYCL 2.0
4休斯敦COMBCYCL NUKE 4.0
迈阿密NUKE COAL 1.0
坦率地说,我不明白以下行为:
在[77]中:df.set_index(cols2).groupby(level = cols2)['p234_r_c']。nlargest (1)
出[77]:
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin煤炭NUKE Austin煤炭NUKE 3.0
芝加哥煤炭煤炭公司芝加哥煤炭公司0.5美元b $ b煤炭煤炭Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL芝加哥NUKE COMBCYCL 2.0
休斯顿COMBCYCL NUKE休斯顿COMBCYCL NUKE 4.0
迈阿密NUKE COAL迈阿密NUKE COAL 1.0
名称:p234_r_c,dtype:float64
其中:
在[78]中:cols2
Out [78]:['city1','plan t1_type','plant2_type']
This question is an extension of the following post: select largest N of a column of each groupby group using pandas
Lets use the same df and the workaround proposed in the selected answer. Basically, I am trying to do 2 groupby operations and select the nlargest N of each group. However as you can see below I get Errors for one of the operations.
Given that the original post discovered a bug in the code (see here), I am wondering whether there is another bug or another manifestation of same bug?
Unfortunately, I am at a stand still in my work until these issues are fixed and worked out. Can we kindly get some attention on this matter? I can't offer a bounty until tomorrow.
df:
{'city1': {0: 'Chicago',
1: 'Chicago',
2: 'Chicago',
3: 'Chicago',
4: 'Miami',
5: 'Houston',
6: 'Austin'},
'city2': {0: 'Toronto',
1: 'Detroit',
2: 'St.Louis',
3: 'Miami',
4: 'Dallas',
5: 'Dallas',
6: 'Dallas'},
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
'plant1_type': {0: 'COMBCYCL',
1: 'COMBCYCL',
2: 'NUKE',
3: 'COAL',
4: 'NUKE',
5: 'COMBCYCL',
6: 'COAL'},
'plant2_type': {0: 'COAL',
1: 'COAL',
2: 'COMBCYCL',
3: 'COMBCYCL',
4: 'COAL',
5: 'NUKE',
6: 'NUKE'}}
You can use the above dict to generate the df: pd.DataFrame(dct)
First groupby: Seems to generate results that make sense
cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
city2 plant1_type plant2_type p234_r_c
0 Toronto COMBCYCL COAL 5.0
1 Detroit COMBCYCL COAL 4.0
2 St.Louis NUKE COMBCYCL 2.0
3 Miami COAL COMBCYCL 0.5
4 Dallas NUKE COAL 1.0
5 Dallas COMBCYCL NUKE 4.0
6 Dallas COAL NUKE 3.0
Second groupby: Produces an error. The only difference is city1
is used rather than city2
.
cols = ['city1','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
Error result:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-443-6426182b55e1> in <module>()
----> 1 test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\series.py in reset_index(self, level, drop, name, inplace)
967 else:
968 df = self.to_frame(name)
--> 969 return df.reset_index(level=level, drop=drop)
970
971 def __unicode__(self):
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
2944 level_values = _maybe_casted_values(lev, lab)
2945 if level is None or i in level:
-> 2946 new_obj.insert(0, col_name, level_values)
2947
2948 elif not drop:
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
2447 value = self._sanitize_column(column, value)
2448 self._data.insert(loc, column, value,
-> 2449 allow_duplicates=allow_duplicates)
2450
2451 def assign(self, **kwargs):
C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py in insert(self, loc, item, value, allow_duplicates)
3508 if not allow_duplicates and item in self.items:
3509 # Should this be a different kind of error??
-> 3510 raise ValueError('cannot insert %s, already exists' % item)
3511
3512 if not isinstance(loc, int):
ValueError: cannot insert plant2_type, already exists
Lastly:
How can I get the city1
column in the result of groupby using ['city2','plant1_type','plant2_type']
and city2
column in the result of groupby using ['city1','plant1_type','plant2_type']
?
I want to know the corresponding city1
value for groupby using ['city2','plant1_type','plant2_type']
and corresponding city2
value for groupby using ['city1','plant1_type','plant2_type']
.
UPDATE:
Why are the results of the following have completely different structures? The only difference is that city2
is used in #A while city1
is used in #B.
A)
cols = ['city2','plant1_type','plant2_type']
test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)
city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
Detroit COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami COAL COMBCYCL 0.5
Dallas NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0
Name: p234_r_c, dtype: float64
B)
cols2 = ['city1','plant1_type','plant2_type']
test1.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin COAL NUKE Austin COAL NUKE 3.0
Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5
COMBCYCL COAL Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0
Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0
Miami NUKE COAL Miami NUKE COAL 1.0
Name: p234_r_c, dtype: float64
解决方案 Try this:
In [76]: df.groupby(cols2)['p234_r_c'].nlargest(1).reset_index(level=3, drop=True).reset_index()
Out[76]:
city1 plant1_type plant2_type p234_r_c
0 Austin COAL NUKE 3.0
1 Chicago COAL COMBCYCL 0.5
2 Chicago COMBCYCL COAL 5.0
3 Chicago NUKE COMBCYCL 2.0
4 Houston COMBCYCL NUKE 4.0
5 Miami NUKE COAL 1.0
Frankly speaking I don't understand the following behavior:
In [77]: df.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)
Out[77]:
city1 plant1_type plant2_type city1 plant1_type plant2_type
Austin COAL NUKE Austin COAL NUKE 3.0
Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5
COMBCYCL COAL Chicago COMBCYCL COAL 5.0
NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0
Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0
Miami NUKE COAL Miami NUKE COAL 1.0
Name: p234_r_c, dtype: float64
where:
In [78]: cols2
Out[78]: ['city1', 'plant1_type', 'plant2_type']
这篇关于在 pandas 中使用groupby和nlargest()更奇怪的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文