创建“sessionID”基于“userID”和“timeStamp”的差异 [英] Create a "sessionID" based on "userID" and differences in "timeStamp"

查看：167 发布时间：2018/8/2 13:42:37 r loops indexing

本文介绍了创建“sessionID”基于“userID”和“timeStamp”的差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对不起，另一个新手问题。我试图根据现有的ID或索引获取部分数据框，然后根据第二列中值的差异创建一个新的ID或索引列。

例如，在下面的示例数据中，userID 1似乎有2个会话：一个从timeStamp 1开始，到timeStamp 6结束，另一个从timeStamp 40开始，到timeStamp 47结束。如果两个timeStamps之间的差异是=< 30（比如分钟），那么两个timeStamps被认为是在同一个会话中。但是当同一个userID从6跳到40时，这被认为是一个新的会话（差异大于30），那么这被认为是一个新的会话。用户2只有1个会话; User3有3个。

理想情况下，我想在sessionID中保留userID信息;最后两列是所需格式的示例。如果只是使它们成为整数更容易，我可以稍后连接userID和sessID。 var1，var2，varN只是为了表明数据框中还有其他数据。

我试图避免传统的循环并得到R-esque。我获取了userID和timeStamp信息，并通过userID创建了一个列表，其中timeStamps作为列表1到最后一个userID的向量：

  byUser<  -  with（myDF，split（timeStamp，userID））

一些真实数据如下所示：

  structure（list（`1` = c） （50108,50108,50171,50175,121316,121316，
 127228），`2` = c（55145,745210,1407020,2283255），...

然后我使用 diff 来获得每个向量中timeStamps之间的差异：

  myDiff2<  -  lapply（byUser，diff）

一些真实数据如下所示：

  structure（list（`1` = c） （0,63,4,71141,0,5912），`2` = c（690065，
 661810,876235），`3` = c（109,80,98,948417,0），

...现在我觉得应该遍历每个列表，初始化sessID，然后如果值在myDif中f2> 1800秒（30分钟），增加sessID。

这似乎很长;请告诉我如何缩短它！提前致谢！

  userID timeStamp var1 var2 varN sessID1 sessID2 
 1 1 1 xy N 1.0 1.1 
 2 1 3 xy N 1.0 1.1 
 3 1 6 xy N 1.0 1.1 
 4 1 40 xy N 1.1 1.2 
 5 1 42 xy N 1.1 1.2 
 6 1 43 xy N 1.1 1.2 
 7 1 47 xy N 1.1 1.2 
 8 2 5 xy N 2.0 2.1 
 9 2 8 xy N 2.0 2.1 
 10 3 2 xy N 3.0 3.1 
 11 3 5 xy N 3.0 3.1 
 12 3 38 xy N 3.1 3.2 
 13 3 39 xy N 3.1 3.2 
 14 3 39 xy N 3.1 3.2 
 15 3 82 xy N 3.2 3.3 
 3 3 83 xy N 3.2 3.3 
 17 3 90 xy N 3.2 3.3 
 18 3 91 xy N 3.2 3.3 
 19 3 102 xy N 3.2 3.3

数据示例的dput（）在这里：

  myDF<  -  structure（list（ userID = c（1L，1L，1L，1L，1L，1L，1L，2L，2L，
 3L，3L，3L，3L，3L，3L，3L，3L，3L，3L），timeStamp = c （1L，3L，
 6L，40L，42L，43L，47L，5L，8L，2L，5L，38L，39L，39L，82L，83L，
 90L，91L，102L），var1 =结构（c（1L，1L，1L，1L，1L，1L，1L，
 1L，1L，1L，1L，1L，1L，1L，1L，1L，1L，1L，1L）,.标签= x，class =factor），
 var2 =结构（c（1L，1L，1L，1L，1L，1L，1L，1L，1L，1L，
 1L，1L，1L） ，1L，1L，1L，1L，1L，1L）,. Label =y，class =factor），
 varN =结构（c（1L，1L，1L，1L，1L，1L， 1L，1L，1L，1L，
 1L，1L，1L，1L，1L，1L，1L，1L，1L）,. Label =N，class =factor），
 sessID1 = c（1,1,1,1.1,1.1,1.1,1.1,2,2,3,3,3.1， 
 3.1,3.1,3.2,3.2,3.2,3.2,3.2），sessID2 = c（1.1,1.1，
 1.1,1.2,1.2,1.2,1.2,2.1,2.1,3.1,3.1,3.2 ，3.2,3.2，
 3.3,3.3,3.3,3.3,3.3）），。Name = c（userID，timeStamp，
var1，var2，varN， sessID1，sessID2），class =data.frame，row.names = c（NA，
 -19L））

===
以下答案的附录：

对于下一个新手：

选择'。'/ / decimal分隔符对我来说可能并不出色：当sessID计数器从9滚动到0时，它导致了一些奇怪和非唯一的sessID。 / p>

将分隔符更改为其他字符 - 如连字符 - 一切都很好。

@rawr和@jlhoward - 感谢您的快速，正确和非常有用的回复：两种方法都运作良好。 @jlhoward - 特别感谢addt'l，值得称道的解释。（@rawr是第一个，所以我认可他的答案。）

两个解决方案之间的性能差异很小：data.table更快但需要一些添加data.frame到data.table的前端转换。

再次感谢所有。

解决方案

  library（plyr）
 
 ddply（myDF，。（userID），transform，
 sessID3 = paste （userID，
c（0，cumsum（sapply（1：（length（userID） -  1），
 function（x）
 ifelse（（timeStamp [x + 1]  -  timeStamp [x ]）> 30，
 1,0）））），sep ='。'），
 sessID4 =粘贴（userID，
c（0，cumsum）（sapply（1：（长度） （userID） -  1），
 function（x）
 ifelse（（timeStamp [x + 1]  -  timeStamp [x]） > 30，
 1,0））））+ 1，sep ='。'））

给我：

 ＃userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4 
＃1 1 1 xy N 1.0 1.1 1.0 1.1 
＃2 1 3 xy N 1.0 1.1 1.0 1.1 
＃3 1 6 xy N 1.0 1.1 1.0 1.1 
＃4 1 40 xy N 1.1 1.2 1.1 1.2 
＃5 1 42 xy N 1.1 1.2 1.1 1.2 
＃6 1 43 xy N 1.1 1.2 1.1 1.2 
＃7 1 47 xy N 1.1 1.2 1.1 1.2 
＃8 2 5 xy N 2.0 2.1 2.0 2.1 
＃9 2 8 xy N 2.0 2.1 2.0 2.1 
＃10 3 2 xy N 3.0 3.1 3.0 3.1 
＃11 3 5 xy N 3.0 3.1 3.0 3.1 
＃12 3 38 xy N 3.1 3.2 3.1 3.2 
＃13 3 39 xy N 3.1 3.2 3.1 3.2 
＃14 3 39 xy N 3.1 3.2 3.1 3.2 
＃15 3 82 xy N 3.2 3.3 3.2 3.3 
＃16 3 83 xy N 3.2 3.3 3.2 3.3 
＃17 3 90 xy N 3.2 3.3 3.2 3.3 
＃18 3 91 xy N 3.2 3.3 3.2 3.3 
＃19 3 102 xy N 3.2 3.3 3.2 3.3

Sorry, another newbie question. I am trying to take parts of data frame based on an existing ID or index, and then create a new ID or index column based on the the difference in values in a second column.

For example, in the example data below, userID 1 appears to have 2 sessions: one starting at timeStamp 1 and ending at timeStamp 6, and another starting at timeStamp 40 and ending at timeStamp 47. If the difference between two timeStamps is =< 30 (say, minutes), then the two timeStamps are considered to be in the same session. But when the same userID jumps from 6 to 40, that's considered a new session (difference is > 30), then that's considered a new session. User 2 only has 1 session; User3 has 3.

Ideally, I'd like to retain the userID information in the sessionIDs; the last 2 columns are examples of desired formats. If it's easier to just make them integers, I can concatenate the userID and sessID later. var1, var2, varN are there just to show that there is other data in the data frame.

I am trying to avoid traditional looping and get R-esque. I took the userID and timeStamp information and created a list by userID with the timeStamps as the vectors of list 1 to the last userID:

byUser <- with(myDF, split(timeStamp, userID))

Some of the real data look like this:

structure(list(`1` = c(50108, 50108, 50171, 50175, 121316, 121316, 
127228), `2` = c(55145, 745210, 1407020, 2283255),...

Then I used diff to get the difference between the timeStamps in each vector:

myDiff2 <- lapply(byUser, diff)

Some of the real data look like this:

structure(list(`1` = c(0, 63, 4, 71141, 0, 5912), `2` = c(690065, 
661810, 876235), `3` = c(109, 80, 98, 948417, 0),

...now I feel as if should loop through each list, initialize the sessID, and then if the value in myDiff2 is > 1800 seconds (30 mins), increment sessID.

This seemed really long; please tell me how I could have shortened it! Thanks in advance!

   userID timeStamp var1 var2 varN sessID1 sessID2
1       1         1    x    y    N     1.0     1.1
2       1         3    x    y    N     1.0     1.1
3       1         6    x    y    N     1.0     1.1
4       1        40    x    y    N     1.1     1.2
5       1        42    x    y    N     1.1     1.2
6       1        43    x    y    N     1.1     1.2
7       1        47    x    y    N     1.1     1.2
8       2         5    x    y    N     2.0     2.1
9       2         8    x    y    N     2.0     2.1
10      3         2    x    y    N     3.0     3.1
11      3         5    x    y    N     3.0     3.1
12      3        38    x    y    N     3.1     3.2
13      3        39    x    y    N     3.1     3.2
14      3        39    x    y    N     3.1     3.2
15      3        82    x    y    N     3.2     3.3
16      3        83    x    y    N     3.2     3.3
17      3        90    x    y    N     3.2     3.3
18      3        91    x    y    N     3.2     3.3
19      3       102    x    y    N     3.2     3.3

The dput() for the data example is here:

myDF <- structure(list(userID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), timeStamp = c(1L, 3L, 
6L, 40L, 42L, 43L, 47L, 5L, 8L, 2L, 5L, 38L, 39L, 39L, 82L, 83L, 
90L, 91L, 102L), var1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"), 
    var2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "y", class = "factor"), 
    varN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "N", class = "factor"), 
    sessID1 = c(1, 1, 1, 1.1, 1.1, 1.1, 1.1, 2, 2, 3, 3, 3.1, 
    3.1, 3.1, 3.2, 3.2, 3.2, 3.2, 3.2), sessID2 = c(1.1, 1.1, 
    1.1, 1.2, 1.2, 1.2, 1.2, 2.1, 2.1, 3.1, 3.1, 3.2, 3.2, 3.2, 
    3.3, 3.3, 3.3, 3.3, 3.3)), .Names = c("userID", "timeStamp", 
"var1", "var2", "varN", "sessID1", "sessID2"), class = "data.frame", row.names = c(NA, 
-19L))

=== An addendum to the answers below:

For the next newbie:

Picking a '.' / decimal separator was probably not brilliant on my part: it led to some weirdness and non-unique sessID 's as the sessID counter rolled from 9 to 0.

Change the separator to some other character -- like a hyphen -- and all is well.

@rawr and @jlhoward - Thank you both for your quick, correct, and extremely helpful responses: both approaches worked very well. @jlhoward - special thanks for the addt'l, above-the-call-of-duty explanation. (@rawr was first, so I credited him for the answer.)

There was a small difference in performance between the 2 solutions: data.table is faster but requires some addt'l upfront transformations of the data.frame to a data.table.

Thanks again, all.

解决方案

library(plyr)

ddply(myDF, .(userID), transform, 
      sessID3 = paste(userID, 
                      c(0, cumsum(sapply(1:(length(userID) - 1),
                                         function(x)
                                           ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                  1, 0)))), sep = '.'),
      sessID4 = paste(userID, 
                      c(0, cumsum(sapply(1:(length(userID) - 1),
                                         function(x)
                                           ifelse((timeStamp[x + 1] - timeStamp[x]) > 30,
                                                  1, 0)))) + 1, sep = '.'))

Gives me:

#    userID timeStamp var1 var2 varN sessID1 sessID2 sessID3 sessID4
# 1       1         1    x    y    N     1.0     1.1     1.0     1.1
# 2       1         3    x    y    N     1.0     1.1     1.0     1.1
# 3       1         6    x    y    N     1.0     1.1     1.0     1.1
# 4       1        40    x    y    N     1.1     1.2     1.1     1.2
# 5       1        42    x    y    N     1.1     1.2     1.1     1.2
# 6       1        43    x    y    N     1.1     1.2     1.1     1.2
# 7       1        47    x    y    N     1.1     1.2     1.1     1.2
# 8       2         5    x    y    N     2.0     2.1     2.0     2.1
# 9       2         8    x    y    N     2.0     2.1     2.0     2.1
# 10      3         2    x    y    N     3.0     3.1     3.0     3.1
# 11      3         5    x    y    N     3.0     3.1     3.0     3.1
# 12      3        38    x    y    N     3.1     3.2     3.1     3.2
# 13      3        39    x    y    N     3.1     3.2     3.1     3.2
# 14      3        39    x    y    N     3.1     3.2     3.1     3.2
# 15      3        82    x    y    N     3.2     3.3     3.2     3.3
# 16      3        83    x    y    N     3.2     3.3     3.2     3.3
# 17      3        90    x    y    N     3.2     3.3     3.2     3.3
# 18      3        91    x    y    N     3.2     3.3     3.2     3.3
# 19      3       102    x    y    N     3.2     3.3     3.2     3.3

这篇关于创建“sessionID”基于“userID”和“timeStamp”的差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

创建“sessionID”基于“userID”和“timeStamp”的差异 [英] Create a "sessionID" based on "userID" and differences in "timeStamp"

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

创建“sessionID”基于“userID”和“timeStamp”的差异 [英] Create a &quot;sessionID&quot; based on &quot;userID&quot; and differences in &quot;timeStamp&quot;

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

创建“sessionID”基于“userID”和“timeStamp”的差异 [英] Create a "sessionID" based on "userID" and differences in "timeStamp"

登录关闭