使用Pandas在数据框的列中找到前3名 [英] Find top 3 in columns of a dataframe using pandas

查看:748
本文介绍了使用Pandas在数据框的列中找到前3名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的时间序列数据集:

I have a time series dataset which looks like this:

Date        Newspaper   City1    City2   Region1Total   City3   City4  Region2Total
2017-12-01  NewsPaper1  231563   8696    240259         21072   8998   30070
2017-12-01  NewsPaper2  173009   12180   185189         28910   5550   34460
2017-12-01  NewsPaper3  40511    4600    45111          5040    3330   8370
2017-12-01  NewsPaper4  37770    2980    40750          6520    1880   8400
2017-12-01  NewsPaper5  5176     900     6076           1790    5000   6790
2017-12-01  NewsPaper6  137650   8025    145675         25300  11000   36300
2017-12-01  Total       637547   38201   675748         91032  36558   127590

2018-01-01  NewsPaper1  231295   8391    239686         8790   21176   29966
2018-01-01  NewsPaper2  169937   12130   182067         7890   28850   36740
2018-01-01  NewsPaper3  40453    4570    45023          4750   5055    9800
2018-01-01  NewsPaper4  37766    2970    40736          2500   6540    9040
2018-01-01  NewsPaper5  5136     900     6036           5600   1795    7365
2018-01-01  NewsPaper6  137990   8010    146000         14500  25330   39830
2018-01-01  Total       633919   37786   671705         44980  91141   136121 

我正在尝试在此数据帧的每一列中找到最大n个值.我尝试了以下方法

I am trying to find the max n values in each column of this dataframe. I tried the following method

somelist = []
data = pd.read_excel('newspaper.csv')
data.index = pd.to_datetime(data['Date'], errors='coerce')
last_month = data.loc[data.index[-1]] # i am considering only the previous month(latest month in the dataframe)
last_month.set_index('Newspaper', inplace = True)
for city in last_month.iloc[:, 2: ]:
    top_3 = last_month[city].nlargest(4)[1: ] #highest will be total but we should skip it
    somelist.append(top_3)
print(somelist)

这将生成结果为pandas系列,并具有以下提到的列的名称:

This produces the result as pandas series with the name of the column mentioned below as:

    [Newspaper
    Newspaper1    231295
    Newspaper2    169937
    Newspaper6    137990
    Name: City1, dtype: float64, Newspaper
    Newspaper2    12130.0
    Newspaper1     8391.0
    Newspaper6     8010.0
    Name: City2, dtype: float64, Newspaper
    Newspaper1    240259
    Newspaper2    185189
    Newspaper6    145675
    Name: Region1Total, dtype: float64, Newspaper
    Newspaper6    14500.0
    Newspaper1     8790.0
    Newspaper2     7890.0
    Name: City3, dtype: float64, Newspaper
    Newspaper2    28850.0
    Newspaper6    25330.0
    Newspaper1    21176.0
    Name: City4, dtype: float64, Newspaper
    Newspaper6    36300
    Newspaper2    34460
    Newspaper1    34460
    Name: Region2Total, dtype: float64, Newspaper]

我想要的是每个城市和地区中销售量排名前三的报纸,以及按降序排列的销售数量.我还希望在显示前三名结果之前先打印城市/地区的名称.

What I want is the the top 3 selling newspapers in each city and region along with the sales numbers arranged in the descending order. I also want the name of the city/region to be printed before showing the top 3 result.

预期输出是一个列表或类似以下的系列:

Expected output is a list or a series like below:

Newspaper     City1
Newspaper1    231295
Newspaper2    169937
Newspaper6    137990

Newspaper     City2
Newspaper2    12130.0
Newspaper1     8391.0
Newspaper6     8010.0

Newspaper     Region1Total
Newspaper1    240259
Newspaper2    185189
Newspaper6    145675

Newspaper     City3
Newspaper6    14500.0
Newspaper1     8790.0
Newspaper2     7890.0

Newspaper     City4
Newspaper2    28850.0
Newspaper6    25330.0
Newspaper1    21176.0

Newspaper     Region2Total
Newspaper6    36300
Newspaper2    34460
Newspaper1    34460

此外,如果我想跳过区域而只考虑城市,那我该怎么做呢? 任何帮助,将不胜感激.先感谢您.

Also, if i want to skip the regions and just consider the cities then how can i do it for the same? Any help would be appreciated. Thank you in advance.

推荐答案

首先,您需要获取一个仅列出报纸"而不是全部"的数据框.

First of all, you need to get a dataframe where only Newspapers are listed, not total.

dff = df.loc[df['Newspaper']!='Total']

然后针对city1,您可以执行以下操作:

Then for city1, you can do:

dff[['Newspaper', 'City1']].sort_values(['City1'], ascending=False).head(3)

输出:

     Newspaper  City1
0   NewsPaper1  231563
1   NewsPaper2  173009
5   NewsPaper6  137650

类似地,您可以针对所有感兴趣的列获得结果.

Similarly, you can achieve results for all the columns of interest.

这篇关于使用Pandas在数据框的列中找到前3名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆