在 Pandas DataFrame 中查找列值最大的行 [英] Find row where values for column is maximal in a pandas DataFrame
问题描述
如何找到特定列的值最大值的行?
df.max()
会给我每一列的最大值,我不知道如何得到对应的行.
使用 pandas idxmax
函数.很简单:
或者你也可以使用
numpy.argmax
,比如numpy.argmax(df['A'])
—— 它提供了同样的东西,并且在粗略观察中至少与idxmax
一样快.idxmax()
返回索引标签,而不是整数.示例':如果您将字符串值作为索引标签,例如行 'a' 到 'e',您可能想知道最大值出现在第 4 行(而不是第 'd' 行).
如果您想要该标签在
Index
中的整数位置,您必须手动获取它(因为允许重复的行标签,这可能会很棘手).
历史记录:
idxmax()
过去称为argmax()
0.11 之前argmax
在 1.0.0 之前被弃用,并在 1.0.0 中完全删除- 从 Pandas 0.16 开始,
argmax
曾经存在并执行相同的功能(尽管运行速度似乎比idxmax
慢). argmax
函数返回最大元素的行位置索引内的整数位置.- pandas 转而使用行标签而不是整数索引.位置整数索引曾经非常常见,比标签更常见,尤其是在重复行标签很常见的应用中.
例如,考虑这个带有重复行标签的玩具 DataFrame
:
在 [19]: dfrm出[19]:乙丙0.143693 0.653810 0.586007乙 0.623582 0.312903 0.919076c 0.165438 0.889809 0.000967d 0.308245 0.787776 0.5711950.870068 0.935626 0.606911f 0.037602 0.855193 0.728495克 0.605366 0.338105 0.696460小时 0.000000 0.090814 0.963927我 0.688343 0.188468 0.352213我 0.879000 0.105039 0.900260在 [20]: dfrm['A'].idxmax()出[20]:'我'在 [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix 而不是 .iloc 在旧版本的熊猫中出[21]:乙丙我 0.688343 0.188468 0.352213我 0.879000 0.105039 0.900260
所以这里单纯使用 idxmax
是不够的,而旧形式的 argmax
将正确提供最大值的 positional 位置行(在本例中,位置 9).
这正是动态类型语言中那些令人讨厌的容易出错的行为之一,它使这种事情变得如此不幸,值得一匹死马.如果您正在编写系统代码并且您的系统突然被用于一些在加入之前没有正确清理的数据集,那么很容易以重复的行标签结束,尤其是像金融资产的 CUSIP 或 SEDOL 标识符这样的字符串标签.您无法轻松地使用类型系统来帮助您解决问题,而且您可能无法在不遇到意外丢失数据的情况下强制索引唯一性.
所以你只剩下希望你的单元测试涵盖所有内容(他们没有,或者更有可能没有人编写任何测试)——否则(很可能)你只是等待看看你是否碰巧在运行时遇到这个错误,在这种情况下,你可能不得不从你正在输出结果的数据库中删除许多小时的工作,在 IPython 中用头撞墙试图手动重现问题,最后弄清楚它是因为idxmax
可以只报告最大行的label,然后失望的是没有标准函数自动获取位置em> 为您提供最大行数,自己编写一个有问题的实现,编辑代码,并祈祷您不要再次遇到问题.
How can I find the row for which the value of a specific column is maximal?
df.max()
will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax
function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use
numpy.argmax
, such asnumpy.argmax(df['A'])
-- it provides the same thing, and appears at least as fast asidxmax
in cursory observations.idxmax()
returns indices labels, not integers.Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the
Index
you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax()
used to be calledargmax()
prior to 0.11argmax
was deprecated prior to 1.0.0 and removed entirely in 1.0.0- back as of Pandas 0.16,
argmax
used to exist and perform the same function (though appeared to run more slowly thanidxmax
). argmax
function returned the integer position within the index of the row location of the maximum element.- pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame
with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax
is not sufficient, whereas the old form of argmax
would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax
can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
这篇关于在 Pandas DataFrame 中查找列值最大的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!