如何在Pandas DataFrame中获取第二大行值的列名 [英] How to get column name for second largest row value in pandas DataFrame

查看:682
本文介绍了如何在Pandas DataFrame中获取第二大行值的列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的问题-我想-但似乎我无法解决这个问题.我是Python和Pandas的初学者.我在论坛上进行了搜索,但没有找到适合我需求的(最新)答案.

I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.

我有一个这样的数据框:

I have a data frame such as this one:

df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])

哪个给:

          A   B    C   D
    a1  1.1   2  3.3   4
    a2  2.7  10  5.4   7
    a3  5.3   9  1.5  15

我的问题很简单:我想添加一列,以给出每一行的 second 最大值的列名称.

My question is simple : I would like to add a column that gives the column name of the second max value of each row.

我编写了一个简单的函数,该函数返回每一行的第二个最大值

I have written a simple function which returns the second max value for each row

def get_second_best(x):
    return sorted(x)[-2]

df['value'] = df.apply(lambda row: get_second_best(row), axis=1)

哪个给:

      A   B    C   D  value
a1  1.1   2  3.3   4    3.3
a2  2.7  10  5.4   7    7.0
a3  5.3   9  1.5  15    9.0

但是我找不到如何在值"列中显示列名,而不是值...我在考虑布尔索引(将值"列的值与每一行进行比较),但是我还没有弄清楚该怎么做.

But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.

更清楚地说,我希望它是:

To be clearer, I would like it to be:

      A   B    C   D  value
a1  1.1   2  3.3   4    C
a2  2.7  10  5.4   7    D
a3  5.3   9  1.5  15    B

感谢任何帮助(和解释)!

Any help (and explanation) appreciated!

推荐答案

一种方法是使用

One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:

In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())

In [46]: df
Out[46]:
      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

值得注意的是,在DataFrame.idxmin上选择Series.idxmin可以在性能方面有所不同:

It is worth noting that picking Series.idxmin over DataFrame.idxmin can make a difference performance-wise:

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs

添加到@jpp的答案中,如果性能很重要,则可以使用 Numba ,就像编写C一样编写代码并进行编译:

Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:

from numba import njit, prange

@njit
def arg_second_largest(arr):
    args = np.empty(len(arr), dtype=np.int_)
    for k in range(len(arr)):
        a = arr[k]
        second = np.NINF
        arg_second = 0
        first = np.NINF
        arg_first = 0
        for i in range(len(a)):
            x = a[i]
            if x >= first:
                second = first
                first = x
                arg_second = arg_first
                arg_first = i
            elif x >= second:
                second = x
                arg_second = i
        args[k] = arg_second
    return args

让我们比较形状分别为(1000, 4)(1000, 1000)的两组数据的不同解决方案:

Let's compare the different solutions on two sets of data with shapes (1000, 4) and (1000, 1000) respectively:

df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)]        # 74.1 µs ± 775 ns

df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)]        # 1.11 ms ± 22.6 µs

在最后一种情况下,我可以使用@njit(parallel=True)并用for k in prange(len(arr))替换外环,从而挤出更多点并将基准降低至852 µs.

In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True) and replacing the outer loop with for k in prange(len(arr)).

这篇关于如何在Pandas DataFrame中获取第二大行值的列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆