如何在巨大数据框的每一行中查找前n个值的列索引 [英] How to find column-index of top-n values within each row of huge dataframe
问题描述
我的数据框的格式为:(示例数据)
I have a dataframe of format: (example data)
Metric1 Metric2 Metric3 Metric4 Metric5
ID
1 0.5 0.3 0.2 0.8 0.7
2 0.1 0.8 0.5 0.2 0.4
3 0.3 0.1 0.7 0.4 0.2
4 0.9 0.4 0.8 0.5 0.2
其中分数范围在[0,1]和我希望生成的函数之间,对于每个id(行),该函数计算前n个指标,其中n是函数以及原始数据帧的输入.
where score range between [0,1] and I wish to generate a function that, for each id (row), calculates the top n metrics, where n is an input of the function along with the original dataframe.
我的理想输出是:(例如n = 3)
My ideal output would be:(for eg. n = 3)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
现在,我编写了一个可以正常工作的函数:
Now I have written a function that does work:
def top_n_partners(scores,top_n=3):
metrics = np.array(scores.columns)
records=[]
for rec in scores.to_records():
rec = list(rec)
ID = rec[0]
score_vals = rec[1:]
inds = np.argsort(score_vals)
top_metrics = metrics[inds][::-1]
dic = {
'top_score_%s' % (i+1):top_metrics[i]
for i in range(top_n)
}
dic['ID'] = ID
records.append(dic)
top_n_df = pd.DataFrame(records)
top_n_df.set_index('ID',inplace=True)
return top_n_df
但是,对于我要运行的数据量(具有数百万行的数据帧)而言,它似乎效率很低/很慢,我想知道是否有更聪明的方法来解决这个问题?
However it seems rather inefficient/slow especially for the volume of data I'd be running this over (dataframe with millions of rows) and I was wondering if there was a smarter way to go about this?
推荐答案
您可以使用 numpy.argsort
:
You can use numpy.argsort
:
print (np.argsort(-df.values, axis=1)[:,:3])
[[3 4 0]
[1 2 4]
[2 3 0]
[0 2 3]]
print (df.columns[np.argsort(-df.values, axis=1)[:,:3]])
Index([['Metric4', 'Metric5', 'Metric1'], ['Metric2', 'Metric3', 'Metric5'],
['Metric3', 'Metric4', 'Metric1'], ['Metric1', 'Metric3', 'Metric4']],
dtype='object')
df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)[:,:3]],
index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
谢谢您 Divakar 用于改进:
n = 3
df = pd.DataFrame(df.columns[df.values.argsort(1)[:,-n+2:1:-1]],
index=df.index)
df = df.rename(columns = lambda x: 'Top_{}'.format(x + 1))
print (df)
Top_1 Top_2 Top_3
ID
1 Metric4 Metric5 Metric1
2 Metric2 Metric3 Metric5
3 Metric3 Metric4 Metric1
4 Metric1 Metric3 Metric4
这篇关于如何在巨大数据框的每一行中查找前n个值的列索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!