排序/选择唯一的和最新的数据 [英] sort/select the unique and the latest data

查看:47
本文介绍了排序/选择唯一的和最新的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从我的数据中获取最相关的价值.我想出使用 Pandas 中的 max 和 min 函数来获取最旧和最近的日期,但我找不到其余的修复程序.我试图从我的数据集中取出一家独特的公司和产品,并根据这些点获取其余的数据.如果有人能告诉我 Python 中用于解决此类问题的工具集,或有关如何在 Python 中解决此类问题的指导,那将非常有帮助.

  • 对于security_level,superservere>severe >中等 >材料 >未成年
  • 对于评级,如果我们看到同一家公司和产品同时获得真实和虚假,则为真实
  • 对于 rating_level,关键 >高 >中 >低
  • 对于 first_level,是最早的日期,对于 last_release,是最近的日期
  • score,同类产品和公司中的最高分

<头>
公司产品security_level评分rating_levelfirst_releaselast_release得分
谷歌手机次要真的关键04/11/202003/17/20200.5
谷歌操作系统中等错误中等09/05/201903/20/20210.009
谷歌操作系统次要错误09/04/201905/11/202019
谷歌电视严重真的08/11/202003/04/2021
谷歌手机超级严重错误中等04/06/201508/19/20202.4
谷歌手机次要错误08/08/201908/19/20201.3
苹果iphone次要真的02/03/202010/13/20203
苹果iphone材料真的中等01/21/201803/04/20216
苹果iwatch材料错误04/11/201508/13/20208
苹果iphone材料真的中等10/20/202003/19/20215
戴尔笔记本电脑次要错误01/05/202103/20/20211

输出:

<头>
公司产品security_level评分rating_levelfirst_releaselast_release得分
谷歌手机超级严重真的关键04/06/201508/19/20202.4
谷歌操作系统中等错误中等09/04/201903/20/202119
谷歌电视严重真的08/11/202003/04/2021
苹果iphone材料真的中等01/21/201803/19/20216
苹果iwatch材料错误04/11/201508/13/20208
戴尔笔记本电脑次要错误01/05/202103/20/20211

解决方案

更改 first_releaselast_release 列的 dtypedatetime

df['last_release'] = pd.to_datetime(df['last_release'])df['first_release'] = pd.to_datetime(df['first_release'])

security_levelrating_level 列转换为 ordered categorical 类型

df['rating_level'] = pd.Categorical(df['rating_level'], ['low', 'medium', 'high', 'critical'],ordered=True)df['security_level'] = pd.Categorical(df['security_level'], ['minor', 'material', 'moderate', 'severe', 'supersevere'],ordered=True)

Groupcompanyproduct 上的数据框,并使用agg_dict

agg_dict = {'security_level': 'max', 'rating': 'max', 'rating_level': 'max','first_release': 'min', 'last_release': 'max', 'score': 'max'}out = df.groupby(['company', 'product'], as_index=False, sort=False).agg(agg_dict)

结果

<预><代码>>>>出去公司产品安全_等级评级 rating_level first_release last_release score0 google mobile supersevere 真正的关键 2015-04-06 2020-08-19 2.41 谷歌操作系统中等 虚假中等 2019-09-04 2021-03-20 19.02 谷歌电视严重 真高 2020-08-11 2021-03-04 NaN3 苹果 iphone 材质 True medium 2018-01-21 2021-03-19 6.04 苹果 iwatch 材料假低 2015-04-11 2020-08-13 8.05 戴尔笔记本电脑轻微假低 2021-01-05 2021-03-20 1.0

I'm trying to take the most relevant value from my data here. I figured out to take the oldest and the most recent dates using max and min function in pandas but i couldn't find the fix to the rest. I'm trying to take one unqiue company and product from my data set and get the rest of their data based on these points. If anyone could tell me the toolsets used in python to address such issues that will be great or guidance on how such issues are addressed in python, that would be very helpful.

  • for security_level, superservere>severe > moderate > material > minor
  • for rating, take true if we see the same company and product got both true and flase
  • for rating_level, critical > high > medium > low
  • for first_level, the oldest date and for last_release, the most recent date
  • score, the highest score amongst the same prodcut and company

company product security_level rating rating_level first_release last_release score
google mobile minor TRUE critical 04/11/2020 03/17/2020 0.5
google os moderate FALSE medium 09/05/2019 03/20/2021 0.009
google os minor FALSE low 09/04/2019 05/11/2020 19
google tv severe TRUE high 08/11/2020 03/04/2021
google mobile supersevere FALSE medium 04/06/2015 08/19/2020 2.4
google mobile minor FALSE high 08/08/2019 08/19/2020 1.3
apple iphone minor TRUE low 02/03/2020 10/13/2020 3
apple iphone material TRUE medium 01/21/2018 03/04/2021 6
apple iwatch material FALSE low 04/11/2015 08/13/2020 8
apple iphone material TRUE medium 10/20/2020 03/19/2021 5
dell laptop minor FALSE low 01/05/2021 03/20/2021 1

Output:

company product security_level rating rating_level first_release last_release score
google mobile supersevere TRUE critical 04/06/2015 08/19/2020 2.4
google os moderate FALSE medium 09/04/2019 03/20/2021 19
google tv severe TRUE high 08/11/2020 03/04/2021
apple iphone material TRUE medium 01/21/2018 03/19/2021 6
apple iwatch material FALSE low 04/11/2015 08/13/2020 8
dell laptop minor FALSE low 01/05/2021 03/20/2021 1

解决方案

Change the dtype of first_release and last_release columns to datetime

df['last_release']  = pd.to_datetime(df['last_release'])
df['first_release'] = pd.to_datetime(df['first_release'])

Convert the columns security_level and rating_level to ordered categorical type

df['rating_level'] = pd.Categorical(df['rating_level'], ['low', 'medium', 'high', 'critical'], ordered=True)
df['security_level'] = pd.Categorical(df['security_level'], ['minor', 'material', 'moderate', 'severe', 'supersevere'], ordered=True)

Group the dataframe on columns company and product and aggregate the remaining columns with the corresponding aggregation functions specified in agg_dict

agg_dict = {'security_level': 'max', 'rating': 'max', 'rating_level': 'max',
            'first_release': 'min', 'last_release': 'max', 'score': 'max'}
            
out = df.groupby(['company', 'product'], as_index=False, sort=False).agg(agg_dict)

Result

>>> out

  company product security_level  rating rating_level first_release last_release  score
0  google  mobile    supersevere    True     critical    2015-04-06   2020-08-19    2.4
1  google      os       moderate   False       medium    2019-09-04   2021-03-20   19.0
2  google      tv         severe    True         high    2020-08-11   2021-03-04    NaN
3   apple  iphone       material    True       medium    2018-01-21   2021-03-19    6.0
4   apple  iwatch       material   False          low    2015-04-11   2020-08-13    8.0
5    dell  laptop          minor   False          low    2021-01-05   2021-03-20    1.0

这篇关于排序/选择唯一的和最新的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆