Pandas - 用 Nan 替换重复项并保持行 [英] Pandas - Replace Duplicates with Nan and Keep Row

查看:44
本文介绍了Pandas - 用 Nan 替换重复项并保持行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在保留行的同时用 NaN 替换每个组的重复项?

How do I replace duplicates for each group with NaNs while keeping the rows?

我需要保留行而不删除,并且可能保留第一个原始值出现的位置.

I need to keep rows without removing and perhaps keeping the first original value where it shows up first.

import pandas as pd
from datetime import timedelta

df = pd.DataFrame({
    'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
             '2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
    'value': [10,10,10,10,12,12,12,12],
    'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)


date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 10  Jackie
2   2019-01-01 02:00:00 10  Jackie
3   2019-01-01 03:00:00 10  Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 12  Zoop
6   2019-09-01 04:00:00 12  Zoop
7   2019-09-01 05:00:00 12  Zoop

所需的数据帧:

date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 NaN Jackie
2   2019-01-01 02:00:00 NaN Jackie
3   2019-01-01 03:00:00 NaN Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 NaN Zoop
6   2019-09-01 04:00:00 NaN Zoop
7   2019-09-01 05:00:00 NaN Zoop

重复的值应该只在同一日期删除,而不管频率如何.因此,如果值 10 在 1 月 1 日出现两次,在 1 月 2 日出现 3 次,则值 10 应该只在 1 月 1 日和 1 月 2 日出现一次.

Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.

推荐答案

我假设您检查 valueID 列上的重复项,并进一步检查 date<date

I assume you check duplicates on columns value and ID and further check on date of column date

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan

Out[269]:
                 date  value      ID
0 2019-01-01 00:00:00   10.0  Jackie
1 2019-01-01 01:00:00    NaN  Jackie
2 2019-01-01 02:00:00    NaN  Jackie
3 2019-01-01 03:00:00    NaN  Jackie
4 2019-09-01 02:00:00   12.0    Zoop
5 2019-09-01 03:00:00    NaN    Zoop
6 2019-09-01 04:00:00    NaN    Zoop
7 2019-09-01 05:00:00    NaN    Zoop

正如@Trenton 建议的,您可以使用 pd.NA 来避免导入 numpy

As @Trenton suggest, you may use pd.NA to avoid import numpy

(注意:正如@rafaelc 建议的那样:这里是解释 pd.NAnp.nan 之间的详细区别的链接="https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values" rel="nofollow noreferrer">https:///pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)

(Note: as @rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA

Out[273]:
                 date value      ID
0 2019-01-01 00:00:00    10  Jackie
1 2019-01-01 01:00:00  <NA>  Jackie
2 2019-01-01 02:00:00  <NA>  Jackie
3 2019-01-01 03:00:00  <NA>  Jackie
4 2019-09-01 02:00:00    12    Zoop
5 2019-09-01 03:00:00  <NA>    Zoop
6 2019-09-01 04:00:00  <NA>    Zoop
7 2019-09-01 05:00:00  <NA>    Zoop

这篇关于Pandas - 用 Nan 替换重复项并保持行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆