pandas数据框:基于列和时间范围的重复项 [英] pandas dataframe: duplicates based on column and time range

查看:162
本文介绍了pandas数据框:基于列和时间范围的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个(非常简单的)熊猫数据框,看起来像这样:

I have a (very simplyfied here) pandas dataframe which looks like this:

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我现在想做的是获取所有时间戳记在3秒内的重复邮件.所需的输出将是:

What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

没有第三行,因为其文本与第一行和第二行相同,但其时间戳记不相同 在3秒内.

without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.

我试图将datetime和msg列定义为duplicate()方法的参数,但是由于时间戳不相同,它返回了一个空的数据帧:

I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

有没有一种方法可以为"datetime"参数定义范围?举例说明 像:

Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

这里的任何帮助将一如既往地受到赞赏.

Any help here would as always be very much appreciated.

推荐答案

这段代码给出了预期的输出

This Piece of code gives the expected output

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我在数据框的"msg"列上进行了分组,然后选择了该数据框的"datetime"列,并使用了内置函数

I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.

在使用上述代码之前,请确保您的数据帧在日期时间按升序排序.

Before using above code make sure that your dataframe is sorted on datetime in ascending order.

这篇关于pandas数据框:基于列和时间范围的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆