Pandas:查找重复编辑(连续) - 所有连续编辑都发生在小时间窗口内 [英] Pandas: Find repeat edits (consecutive) - all consecutive edits have happened in small time window

查看:70
本文介绍了Pandas:查找重复编辑(连续) - 所有连续编辑都发生在小时间窗口内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

处理用户上传文档和编辑文档的数据集.目标是识别相同的重复/连续编辑.

要符合连续重复编辑的条件,编辑必须由同一用户、同一文档在短时间内完成,编辑类型应为 editwordsAdded 应该相同.

editTime userId docId editType wordsAdded2000-01-01-13:23:12 101 p101 编辑 052000-01-01-13:24:00 101 p101 编辑 052000-01-01-13:24:10 101 p101 编辑 052000-01-01-13:24:11 101 p101 编辑 052000-01-01-13:25:00 101 p101 编辑 052000-01-01-13:25:13 101 p101 编辑 102000-01-01-13:24:14 101 p101 编辑 12000-01-01-15:28:12 101 d101 上传 002000-01-01-15:30:00 101 d101 编辑 22000-01-01-15:30:01 101 d101 编辑 22000-01-01-15:30:04 101 p101 编辑 122000-01-01-15:30:10 101 p101 编辑 122000-01-01-15:30:11 101 p101 编辑 122000-01-03-11:45:01 101 p102 编辑 442000-01-04-11:45:03 101 c101 编辑 442000-01-03-09:32:04 300 c201 编辑 052000-01-03-13:33:05 300 c301 编辑 042000-01-04-15:12:06 300 c401 编辑 192000-01-04-15:12:10 300 c401 编辑 192000-01-05-16:32:08 300 c401 编辑 32

在上述数据集中,有效的连续编辑是:

2000-01-01-13:23:12 101 p101 编辑 052000-01-01-13:24:00 101 p101 编辑 052000-01-01-13:24:10 101 p101 编辑 052000-01-01-13:24:11 101 p101 编辑 052000-01-01-13:25:00 101 p101 编辑 052000-01-01-15:30:00 101 d101 编辑 22000-01-01-15:30:01 101 d101 编辑 22000-01-01-15:30:04 101 p101 编辑 122000-01-01-15:30:10 101 p101 编辑 122000-01-04-15:12:06 300 c401 编辑 192000-01-04-15:12:10 300 c401 编辑 19

我的方法是:

将时间转换为纪元:

df['editTime'] = df['editTime'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d-%H:%M:%S");).timestamp())multi_edits = df[df.duplicated(['用户身份','文档编号','添加的话','编辑类型'],keep=False)].where(df.editType == 'edit').sort_values(['userId','docId','editType','editTime'])

为了获取时间跨度添加了一个方法:

def time_span(df: pd.DataFrame):df['time_diff'] = df['editTime'].diff(1)返回 df

然后按multi_edits分组得到总的连续编辑和字数

multi_edit_sum_count = multi_edits.groupby(['用户身份','文档编号','添加的话','editType']).应用(时间跨度)multi_edit_30s = multi_edit_sum_count[multi_edit_sum_count.time_diff <30].agg({'wordsAdded':'sum', 'userId':'count'})

我认为这不是准确的方法.在熊猫中也可能有更好的方法来做到这一点.

解决方案

我们可以编写一些逻辑来检查时差是否最大 60 秒(您可以根据需要进行编辑)以及组是否超过 1 行:

def check_time_difference(s):fill_value = pd.Timedelta(0)m1 = s.diff().fillna(fill_value).dt.seconds.le(60).all()m2 = s.shape[0] >1返回 m1 &平方米grp_cols = [userId"、docId"、editType"、wordsAdded"]df[df.groupby(grp_cols)[editTime"].transform(check_time_difference)]

输出

<预><代码>editTime userId docId editType wordsAdded0 2000-01-01 13:23:12 101 p101 编辑 51 2000-01-01 13:24:00 101 p101 编辑 52 2000-01-01 13:24:10 101 p101 编辑 53 2000-01-01 13:24:11 101 p101 编辑 54 2000-01-01 13:25:00 101 p101 编辑 58 2000-01-01 15:30:00 101 d101 编辑 29 2000-01-01 15:30:01 101 d101 编辑 210 2000-01-01 15:30:04 101 p101 编辑 1211 2000-01-01 15:30:10 101 p101 编辑 1212 2000-01-01 15:30:11 101 p101 编辑 1217 2000-01-04 15:12:06 300 c401 编辑 1918 2000-01-04 15:12:10 300 c401 编辑 19

Working with a data set where users are uploading documents and editing the documents. Goal is to identify repeating / consecutive edits that are identical.

To qualify as a consecutive repeating edit, the edit must be done by the same user, same document within a small time window, type of the edit should be edit and wordsAdded should be same.

editTime               userId   docId     editType  wordsAdded
2000-01-01-13:23:12        101     p101       edit        05
2000-01-01-13:24:00        101     p101       edit        05
2000-01-01-13:24:10        101     p101       edit        05
2000-01-01-13:24:11        101     p101       edit        05
2000-01-01-13:25:00        101     p101       edit        05
2000-01-01-13:25:13        101     p101       edit        10
2000-01-01-13:24:14        101     p101       edit        1
2000-01-01-15:28:12        101     d101       upload      00
2000-01-01-15:30:00        101     d101       edit         2
2000-01-01-15:30:01        101     d101       edit         2
2000-01-01-15:30:04        101     p101       edit        12
2000-01-01-15:30:10        101     p101       edit        12
2000-01-01-15:30:11        101     p101       edit        12
2000-01-03-11:45:01        101     p102       edit        44
2000-01-04-11:45:03        101     c101       edit        44
2000-01-03-09:32:04        300     c201       edit        05
2000-01-03-13:33:05        300     c301       edit        04
2000-01-04-15:12:06        300     c401       edit        19
2000-01-04-15:12:10        300     c401       edit        19
2000-01-05-16:32:08        300     c401       edit        32

In the above data set the valid consecutive edits are:

2000-01-01-13:23:12        101     p101       edit        05
2000-01-01-13:24:00        101     p101       edit        05
2000-01-01-13:24:10        101     p101       edit        05
2000-01-01-13:24:11        101     p101       edit        05
2000-01-01-13:25:00        101     p101       edit        05

2000-01-01-15:30:00        101     d101       edit         2
2000-01-01-15:30:01        101     d101       edit         2
2000-01-01-15:30:04        101     p101       edit        12
2000-01-01-15:30:10        101     p101       edit        12

2000-01-04-15:12:06        300     c401       edit        19
2000-01-04-15:12:10        300     c401       edit        19

My approach for this was:

Convert the time to epoch:

df['editTime'] = df['editTime'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d-%H:%M:%S").timestamp())

multi_edits = df[df.duplicated(
    ['userId', 
     'docId', 
     'wordsAdded', 
     'editType'], 
    keep=False)].where(
    df.editType == 'edit').sort_values(
    ['userId','docId', 'editType', 'editTime']
)

In order to get the time span added a method:

def time_span(df: pd.DataFrame):
    df['time_diff'] = df['editTime'].diff(1)
    return df

Then Group by the multi_edits to get the total consecutive edits and word count

multi_edit_sum_count = multi_edits.groupby(
    ['userId', 
     'docId', 
     'wordsAdded', 
     'editType']).apply(time_span)

multi_edit_30s = multi_edit_sum_count[
    multi_edit_sum_count.time_diff < 30].agg(
    {'wordsAdded':'sum', 'userId':'count'})

I don't think this is the accurate approach. Also there might be a better approach in pandas to do this.

解决方案

We can write some logic which checks if the time difference is max 60 seconds (you can edit this to your needs) and if the group is more than 1 row:

def check_time_difference(s):
    fill_value = pd.Timedelta(0)
    m1 = s.diff().fillna(fill_value).dt.seconds.le(60).all()
    m2 = s.shape[0] > 1
    
    return m1 & m2


grp_cols = ["userId", "docId", "editType", "wordsAdded"]
df[df.groupby(grp_cols)["editTime"].transform(check_time_difference)]

Output


              editTime  userId docId editType  wordsAdded
0  2000-01-01 13:23:12     101  p101     edit           5
1  2000-01-01 13:24:00     101  p101     edit           5
2  2000-01-01 13:24:10     101  p101     edit           5
3  2000-01-01 13:24:11     101  p101     edit           5
4  2000-01-01 13:25:00     101  p101     edit           5
8  2000-01-01 15:30:00     101  d101     edit           2
9  2000-01-01 15:30:01     101  d101     edit           2
10 2000-01-01 15:30:04     101  p101     edit          12
11 2000-01-01 15:30:10     101  p101     edit          12
12 2000-01-01 15:30:11     101  p101     edit          12
17 2000-01-04 15:12:06     300  c401     edit          19
18 2000-01-04 15:12:10     300  c401     edit          19

这篇关于Pandas:查找重复编辑(连续) - 所有连续编辑都发生在小时间窗口内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆