在Pandas DataFrame中查找列的值最大的行 [英] Find row where values for column is maximal in a pandas DataFrame
问题描述
如何找到特定列的值最大值的行?
How can I find the row for which the value of a specific column is maximal?
df.max()
将为我提供每一列的最大值,我不知道如何获取对应的行.
df.max()
will give me the maximal value for each column, I don't know how to get the corresponding row.
推荐答案
You just need the argmax()
(now called idxmax
) function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
此功能在Pandas API中已更新为名称idxmax
,尽管从熊猫0.16开始,argmax
仍然存在并执行相同的功能(尽管运行速度比idxmax
慢).
This function was updated to the name idxmax
in the Pandas API, though as of Pandas 0.16, argmax
still exists and performs the same function (though appears to run more slowly than idxmax
).
您也可以只使用numpy.argmax
,例如numpy.argmax(df['A'])
-它提供与两个pandas
函数中的任何一个相同的功能,并且在粗略观察中的显示速度至少与idxmax
一样.
You can also just use numpy.argmax
, such as numpy.argmax(df['A'])
-- it provides the same thing as either of the two pandas
functions, and appears at least as fast as idxmax
in cursory observations.
以前(如注释中所述),似乎argmax
作为单独的函数存在,该函数在最大元素的行位置的索引内提供整数位置.例如,如果您使用字符串值作为索引标签,例如行"a"至"e",则可能想知道最大值出现在第4行(而不是"d"行).但是,在pandas 0.16中,上面列出的所有方法仅为相关行提供了Index
中的 label ,如果您希望该标签在Index
中的位置整数,则可以必须手动获取(由于允许重复的行标签,这可能会很棘手).
Previously (as noted in the comments) it appeared that argmax
would exist as a separate function which provided the integer position within the index of the row location of the maximum element. For example, if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd'). However, in pandas 0.16, all of the listed methods above only provide the label from the Index
for the row in question, and if you want the position integer of that label within the Index
you have to get it manually (which can be tricky now that duplicate row labels are allowed).
总的来说,我认为所有这三种方法(argmax
仍然存在,idxmax
和numpy.argmax
)都转向类似idxmax
的行为是一件坏事,因为它非常通常要求最大的位置整数位置,甚至比在某些索引中希望该位置位置的 label 更常见,尤其是在重复行标签很常见的应用中.
In general, I think the move to idxmax
-like behavior for all three of the approaches (argmax
, which still exists, idxmax
, and numpy.argmax
) is a bad thing, since it is very common to require the positional integer location of a maximum, perhaps even more common than desiring the label of that positional location within some index, especially in applications where duplicate row labels are common.
例如,考虑带重复行标签的玩具DataFrame
:
For example, consider this toy DataFrame
with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
因此,仅天真地使用idxmax
是不够的,而argmax
的旧形式将正确提供最大行的 positioning 位置(在这种情况下,为位置9).
So here a naive use of idxmax
is not sufficient, whereas the old form of argmax
would correctly provide the positional location of the max row (in this case, position 9).
这恰恰是动态类型语言中那些容易发生错误的令人讨厌的行为之一,这种行为使这种事情非常不幸,值得一搏.如果您正在编写系统代码,而系统突然被用于某些在加入之前未正确清理的数据集,则很容易以重复的行标签结尾,尤其是字符串标签,例如金融资产的CUSIP或SEDOL标识符.您无法轻松地使用类型系统来帮助您,并且可能无法在索引上强制唯一性,而不会遇到意外丢失的数据.
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
因此,您只希望单元测试能够覆盖所有内容(它们没有,或者很可能没有人编写任何测试)-否则(很可能)您只是在等待观察是否发生了在运行时误入此错误,在这种情况下,您可能不得不从输出结果的数据库中删除许多小时的工作,将头撞在IPython中,试图手动重现该问题,最后弄清楚这是因为idxmax
只能 报告最大行的 label ,然后对没有标准函数自动获取最大行的位置感到失望为您排行,自己编写一个有问题的实现,编辑代码,并祈祷您不再遇到问题.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax
can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
这篇关于在Pandas DataFrame中查找列的值最大的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!