如果时间戳接近但不相同,则在DataFrame中删除重复项 [英] Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical

查看:33
本文介绍了如果时间戳接近但不相同,则在DataFrame中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,我有以下DataFrame

Imagine that I've got the following DataFrame

            A        | B | C | D
 -------------------------------
 2000-01-01 00:00:00 | 1 | 1 | 1
 2000-01-01 00:04:30 | 1 | 2 | 2
 2000-01-01 00:04:30 | 2 | 3 | 3
 2000-01-02 00:00:00 | 1 | 4 | 4

我想删除 B 相等的行,并且 A 中的值是"close".说,彼此相隔五分钟.因此,在这种情况下,前两行要保留最后两行.

And I want to drop rows where B are equal, and the values in A are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.

因此,我不希望执行 df.dropna(subset = ['A','B'],inplace = True,keep = False),而是要更类似于 df.dropna(subset = ['A','B'],inplace = True,keep = False,func = {'A':some_func}).与

So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False), I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func}). With

def some_func(ts1, ts2):
    delta = ts1 - ts2
    return abs(delta.total_seconds()) >= 5 * 60

在熊猫市中有办法吗?

推荐答案

m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
                    A  B  C  D
2 2000-01-01 00:04:30  2  3  3
3 2000-01-02 00:00:00  1  4  4


详细信息

m 会在彼此之间的5分钟之内得到所有行的掩码.

m gets a mask of all rows within 5 minutes of each other.

m

0    False
1     True
2    False
3    False
Name: A, dtype: bool

m2 是必须删除的所有项目的最终掩码.

m2 is the final mask of all items that must be dropped.

m2

0     True
1     True
2    False
3    False
dtype: bool

这篇关于如果时间戳接近但不相同,则在DataFrame中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆