在Pandas DataFrame中查找列的值最大的行 [英] Find row where values for column is maximal in a pandas DataFrame

查看:2288
本文介绍了在Pandas DataFrame中查找列的值最大的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何找到特定列的值最大值的行?

How can I find the row for which the value of a specific column is maximal?

df.max()将为我提供每一列的最大值,我不知道如何获取对应的行.

df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.

推荐答案

您只需要argmax()(现在称为

You just need the argmax() (now called idxmax) function. It's straightforward:

>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
          A         B         C
0  1.232853 -1.979459 -0.573626
1  0.140767  0.394940  1.068890
2  0.742023  1.343977 -0.579745
3  2.125299 -0.649328 -0.211692
4 -0.187253  1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1

此功能在Pandas API中已更新为名称idxmax,尽管从熊猫0.16开始,argmax仍然存在并执行相同的功能(尽管运行速度比idxmax慢).

This function was updated to the name idxmax in the Pandas API, though as of Pandas 0.16, argmax still exists and performs the same function (though appears to run more slowly than idxmax).

您也可以只使用numpy.argmax,例如numpy.argmax(df['A']) -它提供与两个pandas函数中的任何一个相同的功能,并且在粗略观察中的显示速度至少与idxmax一样.

You can also just use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing as either of the two pandas functions, and appears at least as fast as idxmax in cursory observations.

以前(如注释中所述),似乎argmax作为单独的函数存在,该函数在最大元素的行位置的索引内提供整数位置.例如,如果您使用字符串值作为索引标签,例如行"a"至"e",则可能想知道最大值出现在第4行(而不是"d"行).但是,在pandas 0.16中,上面列出的所有方法仅为相关行提供了Index中的 label ,如果您希望该标签在Index中的位置整数,则可以必须手动获取(由于允许重复的行标签,这可能会很棘手).

Previously (as noted in the comments) it appeared that argmax would exist as a separate function which provided the integer position within the index of the row location of the maximum element. For example, if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd'). However, in pandas 0.16, all of the listed methods above only provide the label from the Index for the row in question, and if you want the position integer of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).

总的来说,我认为所有这三种方法(argmax仍然存在,idxmaxnumpy.argmax)都转向类似idxmax的行为是一件坏事,因为它非常通常要求最大的位置整数位置,甚至比在某些索引中希望该位置位置的 label 更常见,尤其是在重复行标签很常见的应用中.

In general, I think the move to idxmax-like behavior for all three of the approaches (argmax, which still exists, idxmax, and numpy.argmax) is a bad thing, since it is very common to require the positional integer location of a maximum, perhaps even more common than desiring the label of that positional location within some index, especially in applications where duplicate row labels are common.

例如,考虑带重复行标签的玩具DataFrame:

For example, consider this toy DataFrame with a duplicate row label:

In [19]: dfrm
Out[19]: 
          A         B         C
a  0.143693  0.653810  0.586007
b  0.623582  0.312903  0.919076
c  0.165438  0.889809  0.000967
d  0.308245  0.787776  0.571195
e  0.870068  0.935626  0.606911
f  0.037602  0.855193  0.728495
g  0.605366  0.338105  0.696460
h  0.000000  0.090814  0.963927
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

In [20]: dfrm['A'].idxmax()
Out[20]: 'i'

In [21]: dfrm.iloc[dfrm['A'].idxmax()]  # .ix instead of .iloc in older versions of pandas
Out[21]: 
          A         B         C
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

因此,仅天真地使用idxmax是不够的,而argmax的旧形式将正确提供最大行的 positioning 位置(在这种情况下,为位置9).

So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).

这恰恰是动态类型语言中那些容易发生错误的令人讨厌的行为之一,这种行为使这种事情非常不幸,值得一搏.如果您正在编写系统代码,而系统突然被用于某些在加入之前未正确清理的数据集,则很容易以重复的行标签结尾,尤其是字符串标签,例如金融资产的CUSIP或SEDOL标识符.您无法轻松地使用类型系统来帮助您,并且可能无法在索引上强制唯一性,而不会遇到意外丢失的数据.

This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.

因此,您只希望单元测试能够覆盖所有内容(它们没有,或者很可能没有人编写任何测试)-否则(很可能)您只是在等待观察是否发生了在运行时误入此错误,在这种情况下,您可能不得不从输出结果的数据库中删除许多小时的工作,将头撞在IPython中,试图手动重现该问题,最后弄清楚这是因为idxmax只能 报告最大行的 label ,然后对没有标准函数自动获取最大行的位置感到失望为您排行,自己编写一个有问题的实现,编辑代码,并祈祷您不再遇到问题.

So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.

这篇关于在Pandas DataFrame中查找列的值最大的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆