如何在Pandas DataFrame中获取第二大行值的列名 [英] How to get column name for second largest row value in pandas DataFrame
问题描述
我有一个非常简单的问题-我想-但似乎我无法解决这个问题.我是Python和Pandas的初学者.我在论坛上进行了搜索,但没有找到适合我需求的(最新)答案.
I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.
我有一个这样的数据框:
I have a data frame such as this one:
df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])
哪个给:
A B C D
a1 1.1 2 3.3 4
a2 2.7 10 5.4 7
a3 5.3 9 1.5 15
我的问题很简单:我想添加一列,以给出每一行的 second 最大值的列名称.
My question is simple : I would like to add a column that gives the column name of the second max value of each row.
我编写了一个简单的函数,该函数返回每一行的第二个最大值
I have written a simple function which returns the second max value for each row
def get_second_best(x):
return sorted(x)[-2]
df['value'] = df.apply(lambda row: get_second_best(row), axis=1)
哪个给:
A B C D value
a1 1.1 2 3.3 4 3.3
a2 2.7 10 5.4 7 7.0
a3 5.3 9 1.5 15 9.0
但是我找不到如何在值"列中显示列名,而不是值...我在考虑布尔索引(将值"列的值与每一行进行比较),但是我还没有弄清楚该怎么做.
But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.
更清楚地说,我希望它是:
To be clearer, I would like it to be:
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
感谢任何帮助(和解释)!
Any help (and explanation) appreciated!
推荐答案
一种方法是使用 Series.idxmin
:
One approach would be to pick out the two largest elements in each row using Series.nlargest
and find the column corresponding to the smallest of those using Series.idxmin
:
In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())
In [46]: df
Out[46]:
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
值得注意的是,在DataFrame.idxmin
上选择Series.idxmin
可以在性能方面有所不同:
It is worth noting that picking Series.idxmin
over DataFrame.idxmin
can make a difference performance-wise:
df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs
添加到@jpp的答案中,如果性能很重要,则可以使用 Numba ,就像编写C一样编写代码并进行编译:
Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:
from numba import njit, prange
@njit
def arg_second_largest(arr):
args = np.empty(len(arr), dtype=np.int_)
for k in range(len(arr)):
a = arr[k]
second = np.NINF
arg_second = 0
first = np.NINF
arg_first = 0
for i in range(len(a)):
x = a[i]
if x >= first:
second = first
first = x
arg_second = arg_first
arg_first = i
elif x >= second:
second = x
arg_second = i
args[k] = arg_second
return args
让我们比较形状分别为(1000, 4)
和(1000, 1000)
的两组数据的不同解决方案:
Let's compare the different solutions on two sets of data with shapes (1000, 4)
and (1000, 1000)
respectively:
df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)] # 74.1 µs ± 775 ns
df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)] # 1.11 ms ± 22.6 µs
在最后一种情况下,我可以使用@njit(parallel=True)
并用for k in prange(len(arr))
替换外环,从而挤出更多点并将基准降低至852 µs.
In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True)
and replacing the outer loop with for k in prange(len(arr))
.
这篇关于如何在Pandas DataFrame中获取第二大行值的列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!