Pandas:查找重复编辑(连续) - 所有连续编辑都发生在小时间窗口内 [英] Pandas: Find repeat edits (consecutive) - all consecutive edits have happened in small time window
问题描述
处理用户上传文档和编辑文档的数据集.目标是识别相同的重复/连续编辑.
要符合连续重复编辑的条件,编辑必须由同一用户、同一文档在短时间内完成,编辑类型应为 edit
和 wordsAdded代码> 应该相同.
editTime userId docId editType wordsAdded2000-01-01-13:23:12 101 p101 编辑 052000-01-01-13:24:00 101 p101 编辑 052000-01-01-13:24:10 101 p101 编辑 052000-01-01-13:24:11 101 p101 编辑 052000-01-01-13:25:00 101 p101 编辑 052000-01-01-13:25:13 101 p101 编辑 102000-01-01-13:24:14 101 p101 编辑 12000-01-01-15:28:12 101 d101 上传 002000-01-01-15:30:00 101 d101 编辑 22000-01-01-15:30:01 101 d101 编辑 22000-01-01-15:30:04 101 p101 编辑 122000-01-01-15:30:10 101 p101 编辑 122000-01-01-15:30:11 101 p101 编辑 122000-01-03-11:45:01 101 p102 编辑 442000-01-04-11:45:03 101 c101 编辑 442000-01-03-09:32:04 300 c201 编辑 052000-01-03-13:33:05 300 c301 编辑 042000-01-04-15:12:06 300 c401 编辑 192000-01-04-15:12:10 300 c401 编辑 192000-01-05-16:32:08 300 c401 编辑 32
在上述数据集中,有效的连续编辑是:
2000-01-01-13:23:12 101 p101 编辑 052000-01-01-13:24:00 101 p101 编辑 052000-01-01-13:24:10 101 p101 编辑 052000-01-01-13:24:11 101 p101 编辑 052000-01-01-13:25:00 101 p101 编辑 052000-01-01-15:30:00 101 d101 编辑 22000-01-01-15:30:01 101 d101 编辑 22000-01-01-15:30:04 101 p101 编辑 122000-01-01-15:30:10 101 p101 编辑 122000-01-04-15:12:06 300 c401 编辑 192000-01-04-15:12:10 300 c401 编辑 19
我的方法是:
将时间转换为纪元:
df['editTime'] = df['editTime'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d-%H:%M:%S");).timestamp())multi_edits = df[df.duplicated(['用户身份','文档编号','添加的话','编辑类型'],keep=False)].where(df.editType == 'edit').sort_values(['userId','docId','editType','editTime'])
为了获取时间跨度添加了一个方法:
def time_span(df: pd.DataFrame):df['time_diff'] = df['editTime'].diff(1)返回 df
然后按multi_edits
分组得到总的连续编辑和字数
multi_edit_sum_count = multi_edits.groupby(['用户身份','文档编号','添加的话','editType']).应用(时间跨度)multi_edit_30s = multi_edit_sum_count[multi_edit_sum_count.time_diff <30].agg({'wordsAdded':'sum', 'userId':'count'})
我认为这不是准确的方法.在熊猫中也可能有更好的方法来做到这一点.
我们可以编写一些逻辑来检查时差是否最大 60 秒(您可以根据需要进行编辑)以及组是否超过 1 行:
def check_time_difference(s):fill_value = pd.Timedelta(0)m1 = s.diff().fillna(fill_value).dt.seconds.le(60).all()m2 = s.shape[0] >1返回 m1 &平方米grp_cols = [userId"、docId"、editType"、wordsAdded"]df[df.groupby(grp_cols)[editTime"].transform(check_time_difference)]
输出
<预><代码>editTime userId docId editType wordsAdded0 2000-01-01 13:23:12 101 p101 编辑 51 2000-01-01 13:24:00 101 p101 编辑 52 2000-01-01 13:24:10 101 p101 编辑 53 2000-01-01 13:24:11 101 p101 编辑 54 2000-01-01 13:25:00 101 p101 编辑 58 2000-01-01 15:30:00 101 d101 编辑 29 2000-01-01 15:30:01 101 d101 编辑 210 2000-01-01 15:30:04 101 p101 编辑 1211 2000-01-01 15:30:10 101 p101 编辑 1212 2000-01-01 15:30:11 101 p101 编辑 1217 2000-01-04 15:12:06 300 c401 编辑 1918 2000-01-04 15:12:10 300 c401 编辑 19Working with a data set where users are uploading documents and editing the documents. Goal is to identify repeating / consecutive edits that are identical.
To qualify as a consecutive repeating edit, the edit must be done by the same user, same document within a small time window, type of the edit should be edit
and wordsAdded
should be same.
editTime userId docId editType wordsAdded
2000-01-01-13:23:12 101 p101 edit 05
2000-01-01-13:24:00 101 p101 edit 05
2000-01-01-13:24:10 101 p101 edit 05
2000-01-01-13:24:11 101 p101 edit 05
2000-01-01-13:25:00 101 p101 edit 05
2000-01-01-13:25:13 101 p101 edit 10
2000-01-01-13:24:14 101 p101 edit 1
2000-01-01-15:28:12 101 d101 upload 00
2000-01-01-15:30:00 101 d101 edit 2
2000-01-01-15:30:01 101 d101 edit 2
2000-01-01-15:30:04 101 p101 edit 12
2000-01-01-15:30:10 101 p101 edit 12
2000-01-01-15:30:11 101 p101 edit 12
2000-01-03-11:45:01 101 p102 edit 44
2000-01-04-11:45:03 101 c101 edit 44
2000-01-03-09:32:04 300 c201 edit 05
2000-01-03-13:33:05 300 c301 edit 04
2000-01-04-15:12:06 300 c401 edit 19
2000-01-04-15:12:10 300 c401 edit 19
2000-01-05-16:32:08 300 c401 edit 32
In the above data set the valid consecutive edits are:
2000-01-01-13:23:12 101 p101 edit 05
2000-01-01-13:24:00 101 p101 edit 05
2000-01-01-13:24:10 101 p101 edit 05
2000-01-01-13:24:11 101 p101 edit 05
2000-01-01-13:25:00 101 p101 edit 05
2000-01-01-15:30:00 101 d101 edit 2
2000-01-01-15:30:01 101 d101 edit 2
2000-01-01-15:30:04 101 p101 edit 12
2000-01-01-15:30:10 101 p101 edit 12
2000-01-04-15:12:06 300 c401 edit 19
2000-01-04-15:12:10 300 c401 edit 19
My approach for this was:
Convert the time to epoch:
df['editTime'] = df['editTime'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d-%H:%M:%S").timestamp())
multi_edits = df[df.duplicated(
['userId',
'docId',
'wordsAdded',
'editType'],
keep=False)].where(
df.editType == 'edit').sort_values(
['userId','docId', 'editType', 'editTime']
)
In order to get the time span added a method:
def time_span(df: pd.DataFrame):
df['time_diff'] = df['editTime'].diff(1)
return df
Then Group by the multi_edits
to get the total consecutive edits and word count
multi_edit_sum_count = multi_edits.groupby(
['userId',
'docId',
'wordsAdded',
'editType']).apply(time_span)
multi_edit_30s = multi_edit_sum_count[
multi_edit_sum_count.time_diff < 30].agg(
{'wordsAdded':'sum', 'userId':'count'})
I don't think this is the accurate approach. Also there might be a better approach in pandas to do this.
We can write some logic which checks if the time difference is max 60 seconds (you can edit this to your needs) and if the group is more than 1 row:
def check_time_difference(s):
fill_value = pd.Timedelta(0)
m1 = s.diff().fillna(fill_value).dt.seconds.le(60).all()
m2 = s.shape[0] > 1
return m1 & m2
grp_cols = ["userId", "docId", "editType", "wordsAdded"]
df[df.groupby(grp_cols)["editTime"].transform(check_time_difference)]
Output
editTime userId docId editType wordsAdded
0 2000-01-01 13:23:12 101 p101 edit 5
1 2000-01-01 13:24:00 101 p101 edit 5
2 2000-01-01 13:24:10 101 p101 edit 5
3 2000-01-01 13:24:11 101 p101 edit 5
4 2000-01-01 13:25:00 101 p101 edit 5
8 2000-01-01 15:30:00 101 d101 edit 2
9 2000-01-01 15:30:01 101 d101 edit 2
10 2000-01-01 15:30:04 101 p101 edit 12
11 2000-01-01 15:30:10 101 p101 edit 12
12 2000-01-01 15:30:11 101 p101 edit 12
17 2000-01-04 15:12:06 300 c401 edit 19
18 2000-01-04 15:12:10 300 c401 edit 19
这篇关于Pandas:查找重复编辑(连续) - 所有连续编辑都发生在小时间窗口内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!