创建“sessionID”基于“userID”和“timeStamp”的差异 [英] Create a "sessionID" based on "userID" and differences in "timeStamp"
问题描述
对不起,另一个新手问题。我试图根据现有的ID或索引获取部分数据框,然后根据第二列中值的差异创建一个新的ID或索引列。
例如,在下面的示例数据中,userID 1似乎有2个会话:一个从timeStamp 1开始,到timeStamp 6结束,另一个从timeStamp 40开始,到timeStamp 47结束。如果两个timeStamps之间的差异是=< 30(比如分钟),那么两个timeStamps被认为是在同一个会话中。但是当同一个userID从6跳到40时,这被认为是一个新的会话(差异大于30),那么这被认为是一个新的会话。用户2只有1个会话; User3有3个。
理想情况下,我想在sessionID中保留userID信息;最后两列是所需格式的示例。如果只是使它们成为整数更容易,我可以稍后连接userID和sessID。 var1,var2,varN只是为了表明数据框中还有其他数据。
我试图避免传统的循环并得到R-esque。我获取了userID和timeStamp信息,并通过userID创建了一个列表
,其中timeStamps作为列表1到最后一个userID的向量:
byUser< - with(myDF,split(timeStamp,userID))
一些真实数据如下所示:
structure(list(`1` = c) (50108,50108,50171,50175,121316,121316,
127228),`2` = c(55145,745210,1407020,2283255),...
然后我使用 diff
来获得每个向量中timeStamps之间的差异:
myDiff2< - lapply(byUser,diff)
一些真实数据如下所示:
structure(list(`1` = c) (0,63,4,71141,0,5912),`2` = c(690065,
661810,876235),`3` = c(109,80,98,948417,0),
...现在我觉得应该遍历每个列表,初始化sessID,然后如果值在myDif中f2> 1800秒(30分钟),增加sessID。
这似乎很长;请告诉我如何缩短它!提前致谢!
userID timeStamp var1 var2 varN sessID1 sessID2
1 1 1 xy N 1.0 1.1
2 1 3 xy N 1.0 1.1
3 1 6 xy N 1.0 1.1
4 1 40 xy N 1.1 1.2
5 1 42 xy N 1.1 1.2
6 1 43 xy N 1.1 1.2
7 1 47 xy N 1.1 1.2
8 2 5 xy N 2.0 2.1
9 2 8 xy N 2.0 2.1
10 3 2 xy N 3.0 3.1
11 3 5 xy N 3.0 3.1
12 3 38 xy N 3.1 3.2
13 3 39 xy N 3.1 3.2
14 3 39 xy N 3.1 3.2
15 3 82 xy N 3.2 3.3
3 3 83 xy N 3.2 3.3
17 3 90 xy N 3.2 3.3
18 3 91 xy N 3.2 3.3
19 3 102 xy N 3.2 3.3
数据示例的dput()在这里:
myDF< - structure(list( userID = c(1L,1L,1L,1L,1L,1L,1L,2L,2L,
3L,3L,3L,3L,3L,3L,3L,3L,3L,3L),timeStamp = c (1L,3L,
6L,40L,42L,43L,47L,5L,8L,2L,5L,38L,39L,39L,82L,83L,
90L,91L,102L),var1 =结构(c(1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),.标签= x,class =factor),
var2 =结构(c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L) ,1L,1L,1L,1L,1L,1L),. Label =y,class =factor),
varN =结构(c(1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L),. Label =N,class =factor),
sessID1 = c(1,1,1,1.1,1.1,1.1,1.1,2,2,3,3,3.1,
3.1,3.1,3.2,3.2,3.2,3.2,3.2),sessID2 = c(1.1,1.1,
1.1,1.2,1.2,1.2,1.2,2.1,2.1,3.1,3.1,3.2 ,3.2,3.2,
3.3,3.3,3.3,3.3,3.3)),。Name = c(userID,timeStamp,
var1,var2,varN, sessID1,sessID2),class =data.frame,row.names = c(NA,
-19L))
===
以下答案的附录:
对于下一个新手:
选择'。'/ / decimal分隔符对我来说可能并不出色:当sessID计数器从9滚动到0时,它导致了一些奇怪和非唯一的sessID。 / p>
将分隔符更改为其他字符 - 如连字符 - 一切都很好。
@rawr和@jlhoward - 感谢您的快速,正确和非常有用的回复:两种方法都运作良好。 @jlhoward - 特别感谢addt'l,值得称道的解释。 (@rawr是第一个,所以我认可他的答案。)
两个解决方案之间的性能差异很小:data.table更快但需要一些添加data.frame到data.table的前端转换。
再次感谢所有。
library(plyr)
ddply(myDF,。(userID),transform,
sessID3 = paste (userID,
c(0,cumsum(sapply(1:(length(userID) - 1),
function(x)
ifelse((timeStamp [x + 1] - timeStamp [x ])> 30,
1,0)))),sep ='。'),
sessID4 =粘贴(userID,
c(0,cumsum)(sapply(1:(长度) (userID) - 1),
function(x)
ifelse((timeStamp [x + 1] - timeStamp [x]) > 30,
1,0))))+ 1,sep ='。'))
给我:
#userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4
#1 1 1 xy N 1.0 1.1 1.0 1.1
#2 1 3 xy N 1.0 1.1 1.0 1.1
#3 1 6 xy N 1.0 1.1 1.0 1.1
#4 1 40 xy N 1.1 1.2 1.1 1.2
#5 1 42 xy N 1.1 1.2 1.1 1.2
#6 1 43 xy N 1.1 1.2 1.1 1.2
#7 1 47 xy N 1.1 1.2 1.1 1.2
#8 2 5 xy N 2.0 2.1 2.0 2.1
#9 2 8 xy N 2.0 2.1 2.0 2.1
#10 3 2 xy N 3.0 3.1 3.0 3.1
#11 3 5 xy N 3.0 3.1 3.0 3.1
#12 3 38 xy N 3.1 3.2 3.1 3.2
#13 3 39 xy N 3.1 3.2 3.1 3.2
#14 3 39 xy N 3.1 3.2 3.1 3.2
#15 3 82 xy N 3.2 3.3 3.2 3.3
#16 3 83 xy N 3.2 3.3 3.2 3.3
#17 3 90 xy N 3.2 3.3 3.2 3.3
#18 3 91 xy N 3.2 3.3 3.2 3.3
#19 3 102 xy N 3.2 3.3 3.2 3.3
Sorry, another newbie question. I am trying to take parts of data frame based on an existing ID or index, and then create a new ID or index column based on the the difference in values in a second column.
For example, in the example data below, userID 1 appears to have 2 sessions: one starting at timeStamp 1 and ending at timeStamp 6, and another starting at timeStamp 40 and ending at timeStamp 47. If the difference between two timeStamps is =< 30 (say, minutes), then the two timeStamps are considered to be in the same session. But when the same userID jumps from 6 to 40, that's considered a new session (difference is > 30), then that's considered a new session. User 2 only has 1 session; User3 has 3.
Ideally, I'd like to retain the userID information in the sessionIDs; the last 2 columns are examples of desired formats. If it's easier to just make them integers, I can concatenate the userID and sessID later. var1, var2, varN are there just to show that there is other data in the data frame.
I am trying to avoid traditional looping and get R-esque. I took the userID and timeStamp information and created a list
by userID with the timeStamps as the vectors of list 1 to the last userID:
byUser <- with(myDF, split(timeStamp, userID))
Some of the real data look like this:
structure(list(`1` = c(50108, 50108, 50171, 50175, 121316, 121316,
127228), `2` = c(55145, 745210, 1407020, 2283255),...
Then I used diff
to get the difference between the timeStamps in each vector:
myDiff2 <- lapply(byUser, diff)
Some of the real data look like this:
structure(list(`1` = c(0, 63, 4, 71141, 0, 5912), `2` = c(690065,
661810, 876235), `3` = c(109, 80, 98, 948417, 0),
...now I feel as if should loop through each list, initialize the sessID, and then if the value in myDiff2 is > 1800 seconds (30 mins), increment sessID.
This seemed really long; please tell me how I could have shortened it! Thanks in advance!
userID timeStamp var1 var2 varN sessID1 sessID2
1 1 1 x y N 1.0 1.1
2 1 3 x y N 1.0 1.1
3 1 6 x y N 1.0 1.1
4 1 40 x y N 1.1 1.2
5 1 42 x y N 1.1 1.2
6 1 43 x y N 1.1 1.2
7 1 47 x y N 1.1 1.2
8 2 5 x y N 2.0 2.1
9 2 8 x y N 2.0 2.1
10 3 2 x y N 3.0 3.1
11 3 5 x y N 3.0 3.1
12 3 38 x y N 3.1 3.2
13 3 39 x y N 3.1 3.2
14 3 39 x y N 3.1 3.2
15 3 82 x y N 3.2 3.3
16 3 83 x y N 3.2 3.3
17 3 90 x y N 3.2 3.3
18 3 91 x y N 3.2 3.3
19 3 102 x y N 3.2 3.3
The dput() for the data example is here:
myDF <- structure(list(userID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), timeStamp = c(1L, 3L,
6L, 40L, 42L, 43L, 47L, 5L, 8L, 2L, 5L, 38L, 39L, 39L, 82L, 83L,
90L, 91L, 102L), var1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
var2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "y", class = "factor"),
varN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "N", class = "factor"),
sessID1 = c(1, 1, 1, 1.1, 1.1, 1.1, 1.1, 2, 2, 3, 3, 3.1,
3.1, 3.1, 3.2, 3.2, 3.2, 3.2, 3.2), sessID2 = c(1.1, 1.1,
1.1, 1.2, 1.2, 1.2, 1.2, 2.1, 2.1, 3.1, 3.1, 3.2, 3.2, 3.2,
3.3, 3.3, 3.3, 3.3, 3.3)), .Names = c("userID", "timeStamp",
"var1", "var2", "varN", "sessID1", "sessID2"), class = "data.frame", row.names = c(NA,
-19L))
=== An addendum to the answers below:
For the next newbie:
Picking a '.' / decimal separator was probably not brilliant on my part: it led to some weirdness and non-unique sessID 's as the sessID counter rolled from 9 to 0.
Change the separator to some other character -- like a hyphen -- and all is well.
@rawr and @jlhoward - Thank you both for your quick, correct, and extremely helpful responses: both approaches worked very well. @jlhoward - special thanks for the addt'l, above-the-call-of-duty explanation. (@rawr was first, so I credited him for the answer.)
There was a small difference in performance between the 2 solutions: data.table is faster but requires some addt'l upfront transformations of the data.frame to a data.table.
Thanks again, all.
library(plyr)
ddply(myDF, .(userID), transform,
sessID3 = paste(userID,
c(0, cumsum(sapply(1:(length(userID) - 1),
function(x)
ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
1, 0)))), sep = '.'),
sessID4 = paste(userID,
c(0, cumsum(sapply(1:(length(userID) - 1),
function(x)
ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
1, 0)))) + 1, sep = '.'))
Gives me:
# userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4
# 1 1 1 x y N 1.0 1.1 1.0 1.1
# 2 1 3 x y N 1.0 1.1 1.0 1.1
# 3 1 6 x y N 1.0 1.1 1.0 1.1
# 4 1 40 x y N 1.1 1.2 1.1 1.2
# 5 1 42 x y N 1.1 1.2 1.1 1.2
# 6 1 43 x y N 1.1 1.2 1.1 1.2
# 7 1 47 x y N 1.1 1.2 1.1 1.2
# 8 2 5 x y N 2.0 2.1 2.0 2.1
# 9 2 8 x y N 2.0 2.1 2.0 2.1
# 10 3 2 x y N 3.0 3.1 3.0 3.1
# 11 3 5 x y N 3.0 3.1 3.0 3.1
# 12 3 38 x y N 3.1 3.2 3.1 3.2
# 13 3 39 x y N 3.1 3.2 3.1 3.2
# 14 3 39 x y N 3.1 3.2 3.1 3.2
# 15 3 82 x y N 3.2 3.3 3.2 3.3
# 16 3 83 x y N 3.2 3.3 3.2 3.3
# 17 3 90 x y N 3.2 3.3 3.2 3.3
# 18 3 91 x y N 3.2 3.3 3.2 3.3
# 19 3 102 x y N 3.2 3.3 3.2 3.3
这篇关于创建“sessionID”基于“userID”和“timeStamp”的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!