根据用户ID和时差创建会话ID [英] Create a Session ID from User ID and time differences
问题描述
我对此有类似的问题(创建基于用户ID的会话ID以及创建会话ID时 timeStamp 中的差异;尽管我的规格略有不同。也许解决方案在本文中仍然很明显,但我无法将其应用于我的需求-指出原始解决方案如何满足我的问题。
I have a similar question to this (Create a "sessionID" based on "userID" and differences in "timeStamp") on creating a 'Session ID'; though my specifications are slightly different. Perhaps the solution is still apparent in this post but I could not apply it to my needs -- pointing out how the original solution satisfies my question would be equivalent.
我的 data.table
看起来像这样(dput在下面可用):
My data.table
looks like this (dput available below):
unique_visitor_id datetime
100 2016-07-25 15:43:02
100 2016-08-15 15:35:16
101 2016-08-01 21:24:46
101 2016-08-13 05:32:27
101 2016-08-13 05:33:01
101 2016-08-13 05:33:37
101 2016-08-13 05:34:04
101 2016-08-13 05:37:42
101 2016-08-13 05:38:20
102 2016-09-15 17:28:00
102 2016-09-15 17:31:04
103 2016-07-18 21:19:07
NB: 日期时间
通过 ymd_hms(datetime)
我想要的是一个用于标识会话的新变量,它是一个简单的整数序列(不需要像原始问题一样合并visitorID)—定义了会话只要访问记录少于< = 30m并且在同一天之内。因此,例如,前两行将是两个不同的会话:尽管是同一位访问者,但时间差> 30m。
What I'd like is a new variable identifying the session, which is a simple integer sequence (does not need to incorporate the visitorID, like the original question) -- a session is defined by visitor, as long as records are <= 30m AND within the same day. So for example, the first two rows would be two different sessions: though it's the same visitor, the difference in time is >30m.
上述数据的期望输出将会是:
The desired output from the above data would be:
unique_visitor_id datetime session_id
100 2016-07-25 15:43:02 1
100 2016-08-15 15:35:16 2
101 2016-08-01 21:24:46 3
101 2016-08-13 05:32:27 4
101 2016-08-13 05:33:01 4
101 2016-08-13 05:33:37 4
101 2016-08-13 05:34:04 4
101 2016-08-13 05:37:42 4
101 2016-08-13 05:38:20 4
102 2016-09-15 17:28:00 5
102 2016-09-15 17:31:04 5
103 2016-07-18 21:19:07 6
如果这可以通过 data.table
的方式完成,这是理想的。再次抱歉,如果我从原始问题的解决方案中遗漏了一些东西!
If this can be done in a data.table
way, that would be desirable. Again, apologies if I am missing something from the original question's solution!
这是 dput
示例数据表:
myDT <- structure(list(unique_visitor_id = c(100L, 100L, 101L,
101L, 101L, 101L, 101L, 101L, 101L, 102L, 102L, 103L),
datetime = structure(c(1469475782, 1471289716, 1470101086, 1471080747, 1471080781,
1471080817, 1471080844, 1471081062, 1471081100, 1473974880,
1473975064, 1468891147),
tzone = "EST5EDT", class = c("POSIXct", "POSIXt"))),
.Names = c("unique_visitor_id", "datetime"),
sorted = c("unique_visitor_id", "datetime"),
class = c("data.table", "data.frame"),
row.names = c(NA, -12L))
推荐答案
假设您的da ta帧最初是按访问者ID和日期时间排序的,您可以在条件向量上使用 cumsum()
,其中TRUE是新的 session_id
应该出现:
Assuming your data frame is originally sorted by visitor id and datetime, you can use cumsum()
on the condition vector which is TRUE where a new session_id
should appear:
myDT[, session_id := cumsum(c(T, diff(unique_visitor_id) != 0 | diff(datetime)/60 > 30))][]
# unique_visitor_id datetime session_id
# 1: 100 2016-07-25 15:43:02 1
# 2: 100 2016-08-15 15:35:16 2
# 3: 101 2016-08-01 21:24:46 3
# 4: 101 2016-08-13 05:32:27 4
# 5: 101 2016-08-13 05:33:01 4
# 6: 101 2016-08-13 05:33:37 4
# 7: 101 2016-08-13 05:34:04 4
# 8: 101 2016-08-13 05:37:42 4
# 9: 101 2016-08-13 05:38:20 4
#10: 102 2016-09-15 17:28:00 5
#11: 102 2016-09-15 17:31:04 5
#12: 103 2016-07-18 21:19:07 6
这篇关于根据用户ID和时差创建会话ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!