pandas 分组类别,等级,从每个类别中获得最高价值? [英] Pandas groupby category, rating, get top value from each category?
问题描述
关于SO的第一个问题,对熊猫来说还很陌生,但在术语上仍然有些动摇:我试图找出数据框上正确的语法/操作顺序,以便能够按B列分组,找到最大值(或最小值)C列中每个组的对应值,并检索A列中对应的值.
First question on SO, very new to pandas and still a little shaky on the terminology: I'm trying to figure out the proper syntax/sequence of operations on a dataframe to be able to group by column B, find the max (or min) corresponding value for each group in column C, and retrieve the corresponding value for that in column A.
假设这是我的数据框:
name type votes
bob dog 10
pete cat 8
fluffy dog 5
max cat 9
使用df.groupby('type').votes.agg('max')
返回:
dog 10
cat 9
到目前为止,太好了.但是,我想弄清楚如何返回此值:
So far, so good. However, I'd like to figure out how to return this:
dog 10 bob
cat 9 max
我已经达到了df.groupby(['type', 'votes']).name.agg('max')
,尽管返回了
I've gotten as far as df.groupby(['type', 'votes']).name.agg('max')
, though that returns
dog 5 fluffy
10 bob
cat 8 pete
9 max
...对于这个假装的数据帧来说很好,但是在处理更大的数据帧时并没有太大帮助.
... which is fine for this pretend dataframe, but doesn't quite help when working with a much larger one.
非常感谢!
推荐答案
If df
has an index with no duplicate values, then you can use idxmax
to return the index of the maximum row for each group. Then use df.loc
to select the entire row:
In [322]: df.loc[df.groupby('type').votes.agg('idxmax')]
Out[322]:
name type votes
3 max cat 9
0 bob dog 10
如果df.index
具有重复值,即不是唯一索引,请首先使索引唯一:
If df.index
has duplicate values, i.e. is not a unique index, then make the index unique first:
df = df.reset_index()
然后使用idxmax
:
result = df.loc[df.groupby('type').votes.agg('idxmax')]
如果确实需要,可以将df
返回其原始状态:
If you really need to, you can return df
to its original state:
df = df.set_index(['index'], drop=True)
但是在一般情况下,使用唯一索引会更好.
but in general life is much better with a unique index.
以下是显示df
没有唯一标识时出了什么问题的示例
指数.假设index
是AABB
:
Here is an example showing what goes wrong when df
does not have a unique
index. Suppose the index
is AABB
:
import pandas as pd
df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'],
'type': ['dog', 'cat', 'dog', 'cat'],
'votes': [10, 8, 5, 9]},
index=list('AABB'))
print(df)
# name type votes
# A bob dog 10
# A pete cat 8
# B fluffy dog 5
# B max cat 9
idxmax
返回索引值A
和B
:
print(df.groupby('type').votes.agg('idxmax'))
type
cat B
dog A
Name: votes, dtype: object
但是A
和B
不会唯一地指定所需的行. df.loc[...]
返回其索引值为A
或B
的所有行:
But A
and B
do not uniquely specify the desired rows. df.loc[...]
returns all rows whose index value is A
or B
:
print(df.loc[df.groupby('type').votes.agg('idxmax')])
# name type votes
# B fluffy dog 5
# B max cat 9
# A bob dog 10
# A pete cat 8
相反,如果我们重置索引:
In contrast, if we reset the index:
df = df.reset_index()
# index name type votes
# 0 A bob dog 10
# 1 A pete cat 8
# 2 B fluffy dog 5
# 3 B max cat 9
然后df.loc
可用于选择所需的行:
then df.loc
can be used to select the desired rows:
print(df.groupby('type').votes.agg('idxmax'))
# type
# cat 3
# dog 0
# Name: votes, dtype: int64
print(df.loc[df.groupby('type').votes.agg('idxmax')])
# index name type votes
# 3 B max cat 9
# 0 A bob dog 10
这篇关于 pandas 分组类别,等级,从每个类别中获得最高价值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!