在 Pandas DataFrame 中查找列值最大的行 [英] Find row where values for column is maximal in a pandas DataFrame

查看:81
本文介绍了在 Pandas DataFrame 中查找列值最大的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何找到特定列的值最大值的行?

df.max() 会给我每一列的最大值,我不知道如何得到对应的行.

解决方案

使用 pandas idxmax 函数.很简单:

<预><代码>>>>进口大熊猫>>>将 numpy 导入为 np>>>df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])>>>df乙丙0 1.232853 -1.979459 -0.5736261 0.140767 0.394940 1.0688902 0.742023 1.343977 -0.5797453 2.125299 -0.649328 -0.2116924 -0.187253 1.908618 -1.862934>>>df['A'].idxmax()3>>>df['B'].idxmax()4>>>df['C'].idxmax()1

  • 或者你也可以使用 numpy.argmax,比如 numpy.argmax(df['A']) —— 它提供了同样的东西,并且在粗略观察中至少与 idxmax 一样快.

  • idxmax() 返回索引标签,而不是整数.

  • 示例':如果您将字符串值作为索引标签,例如行 'a' 到 'e',您可能想知道最大值出现在第 4 行(而不是第 'd' 行).

  • 如果您想要该标签在 Index 中的整数位置,您必须手动获取它(因为允许重复的行标签,这可能会很棘手).


历史记录:

  • idxmax() 过去称为 argmax() 0.11 之前
  • argmax 在 1.0.0 之前被弃用,并在 1.0.0 中完全删除
  • 从 Pandas 0.16 开始,argmax 曾经存在并执行相同的功能(尽管运行速度似乎比 idxmax 慢).
  • argmax 函数返回最大元素的行位置索引内的整数位置.
  • pandas 转而使用行标签而不是整数索引.位置整数索引曾经非常常见,比标签更常见,尤其是在重复行标签很常见的应用中.

例如,考虑这个带有重复行标签的玩具 DataFrame:

在 [19]: dfrm出[19]:乙丙0.143693 0.653810 0.586007乙 0.623582 0.312903 0.919076c 0.165438 0.889809 0.000967d 0.308245 0.787776 0.5711950.870068 0.935626 0.606911f 0.037602 0.855193 0.728495克 0.605366 0.338105 0.696460小时 0.000000 0.090814 0.963927我 0.688343 0.188468 0.352213我 0.879000 0.105039 0.900260在 [20]: dfrm['A'].idxmax()出[20]:'我'在 [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix 而不是 .iloc 在旧版本的熊猫中出[21]:乙丙我 0.688343 0.188468 0.352213我 0.879000 0.105039 0.900260

所以这里单纯使用 idxmax 是不够的,而旧形式的 argmax 将正确提供最大值的 positional 位置行(在本例中,位置 9).

这正是动态类型语言中那些令人讨厌的容易出错的行为之一,它使这种事情变得如此不幸,值得一匹死马.如果您正在编写系统代码并且您的系统突然被用于一些在加入之前没有正确清理的数据集,那么很容易以重复的行标签结束,尤其是像金融资产的 CUSIP 或 SEDOL 标识符这样的字符串标签.您无法轻松地使用类型系统来帮助您解决问题,而且您可能无法在不遇到意外丢失数据的情况下强制索引唯一性.

所以你只剩下希望你的单元测试涵盖所有内容(他们没有,或者更有可能没有人编写任何测试)——否则(很可能)你只是等待看看你是否碰巧在运行时遇到这个错误,在这种情况下,你可能不得不从你正在输出结果的数据库中删除许多小时的工作,在 IPython 中用头撞墙试图手动重现问题,最后弄清楚它是因为idxmax可以报告最大行的label,然后失望的是没有标准函数自动获取位置em> 为您提供最大行数,自己编写一个有问题的实现,编辑代码,并祈祷您不要再次遇到问题.

How can I find the row for which the value of a specific column is maximal?

df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.

解决方案

Use the pandas idxmax function. It's straightforward:

>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
          A         B         C
0  1.232853 -1.979459 -0.573626
1  0.140767  0.394940  1.068890
2  0.742023  1.343977 -0.579745
3  2.125299 -0.649328 -0.211692
4 -0.187253  1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1

  • Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.

  • idxmax() returns indices labels, not integers.

  • Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').

  • if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).


HISTORICAL NOTES:

  • idxmax() used to be called argmax() prior to 0.11
  • argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
  • back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
  • argmax function returned the integer position within the index of the row location of the maximum element.
  • pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.

For example, consider this toy DataFrame with a duplicate row label:

In [19]: dfrm
Out[19]: 
          A         B         C
a  0.143693  0.653810  0.586007
b  0.623582  0.312903  0.919076
c  0.165438  0.889809  0.000967
d  0.308245  0.787776  0.571195
e  0.870068  0.935626  0.606911
f  0.037602  0.855193  0.728495
g  0.605366  0.338105  0.696460
h  0.000000  0.090814  0.963927
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

In [20]: dfrm['A'].idxmax()
Out[20]: 'i'

In [21]: dfrm.iloc[dfrm['A'].idxmax()]  # .ix instead of .iloc in older versions of pandas
Out[21]: 
          A         B         C
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).

This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.

So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.

这篇关于在 Pandas DataFrame 中查找列值最大的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆