如果时间戳接近但不相同,则在DataFrame中删除重复项 [英] Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical
问题描述
想象一下,我有以下DataFrame
Imagine that I've got the following DataFrame
A | B | C | D
-------------------------------
2000-01-01 00:00:00 | 1 | 1 | 1
2000-01-01 00:04:30 | 1 | 2 | 2
2000-01-01 00:04:30 | 2 | 3 | 3
2000-01-02 00:00:00 | 1 | 4 | 4
我想删除 B
相等的行,并且 A
中的值是"close".说,彼此相隔五分钟.因此,在这种情况下,前两行要保留最后两行.
And I want to drop rows where B
are equal, and the values in A
are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.
因此,我不希望执行 df.dropna(subset = ['A','B'],inplace = True,keep = False)
,而是要更类似于 df.dropna(subset = ['A','B'],inplace = True,keep = False,func = {'A':some_func})
.与
So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False)
, I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func})
. With
def some_func(ts1, ts2):
delta = ts1 - ts2
return abs(delta.total_seconds()) >= 5 * 60
在熊猫市中有办法吗?
推荐答案
m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
详细信息
m
会在彼此之间的5分钟之内得到所有行的掩码.
m
gets a mask of all rows within 5 minutes of each other.
m
0 False
1 True
2 False
3 False
Name: A, dtype: bool
m2
是必须删除的所有项目的最终掩码.
m2
is the final mask of all items that must be dropped.
m2
0 True
1 True
2 False
3 False
dtype: bool
这篇关于如果时间戳接近但不相同,则在DataFrame中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!