用快速计算创建用户会话 [英] Creating user sessions with fast computation

查看:182
本文介绍了用快速计算创建用户会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个三列的数据框:uuid(即类因子)和created_at(即类POSIXct)和trainer_item_id(因子),我创建了第三列,名为会议。列Sessions表示按时间排序的每个uuid的时间会话,使得任何连续的事件对之间的时间差最多为一小时(3600秒)。



我使用for循环和迭代创建了列会话。问题是我有超过一百万的观察结果,需要8个小时来创建会话。有没有比我的代码更容易和更快的方式来创建它?
感谢您的帮助!

以下是原始数据集的示例 - > https://gist.github.com/einsiol/5b4e633ce69d3a8e43252f383231e4b8



这是我的代码 - - >

  library(dplyr)
#将数据帧试用转换为tibble以使用函数group_by
trial< - tbl_df(trial);试用< - group_by(trial,uuid)

#按时间戳排序(created_at)
试用< - 排列(试用,created_at)

#时间差tdiff的矢量
时间< - 审判$ created_at
tdiff< - vector(mode =numeric,length = 0)
trial $ Sessions< - vector(mode = (长度(试用$ uuid)-1)){字符长度=长度(试用))

count <-1



tdiff [i]< - difftime(time [i + 1],time [i],units =secs)

#如果相同的用户ID
$ b $ if if(tdiff [i] <3600){
trial($ uuid [i + 1] == trial $ uuid [i]){

$ Sessions [i]< - count
trial $ Sessions [i + 1]< - count


} else {
trial $ Sessions [i] < - count
trial $ Sessions [i + 1]< - count
count < - count + 1
}

#如果不同的用户ID
} else {

if(tdiff [i] <3600 ){
trial $ Sessions [i]< - count
trial $ Sessions [i + 1]< - count

} else {
trial $ Sessions [i] < - count
trial $ Sessions [i + 1]< - count
count< - count + 1
}

count< - 1
}
}

更新:我找到了答案问题和快速替代这个代码,你可以在下面找到!

解决方案

我找到了一个非常有效和快速的方法它使用矢量微积分工作。我花了30秒来运行代码(而不是平均5小时!)

  library(data.table); library( (LID $ uuid,LID $ created_at),] 

#计算时间差异((b




$ b $)秒)当前和以前的ligne
LID $ created_at< - as.POSIXct(as.character(LID $ created_at))
LID $ diff< -c(9999,LID $ created_at [-1 ] -LID $ created_at [-nrow(LID)])
options(stringAsFactor = FALSE)

#对应于新的uuid
的行w <-which(LID $ uuid [-1]!= LID $ uuid [-nrow(LID)])

#当uuid
发生变化时,将持续时间设置为NA LID $ diff [w + 1] < -9999

#识别会话变化大于3600秒(1小时)
LID $ chg_session< -as.numeric(LID $ diff> 3600)

#累计和确定id_sessions与inve差异的差额
LID $ idsession< -diffinv(LID $ chg_session)[ - 1]


I have a data frame with three columns: "uuid" (that is class factor) and "created_at" (that is class POSIXct),and "trainer_item_id" (factor) and I created a third column that is named "Sessions". The column Sessions represents time sessions for each uuid ordered by time, such that the time difference between any consecutive pair of events is at most one hour (3600seconds).

I have created the column Sessions using a "for loop" and iteration. The problem is that I have more than a million of observations and it will take 8 hours to create Sessions. Is there an easier and faster way to create it than my code below? Thanks in advance for your help!

Here is a sample of the original dataset --> https://gist.github.com/einsiol/5b4e633ce69d3a8e43252f383231e4b8

Here is my code -->

library(dplyr)
    # Converting the data frame trial to tibble in order to use the function group_by
    trial <- tbl_df(trial); trial <- group_by(trial, uuid)

    # Ordering by timestamp (created_at)
    trial <- arrange(trial, created_at)

    # Creating empty vector of time difference tdiff
    time <- trial$created_at
    tdiff <- vector(mode = "numeric",length = 0)
    trial$Sessions <- vector(mode = "character",length = length(trial))

        count <-1

            for(i in 1:(length(trial$uuid)-1)) {

                tdiff[i] <- difftime(time[i+1], time[i],units = "secs")

                # If same user ID

                if (trial$uuid[i+1]==trial$uuid[i]){

                    if (tdiff[i]<3600){
                        trial$Sessions[i] <- count
                        trial$Sessions[i+1] <- count


                    }else{
                        trial$Sessions[i] <- count
                        trial$Sessions[i+1] <- count
                        count <- count+1
                    }

                    # If different user ID
                }else{

                    if (tdiff[i]<3600){
                        trial$Sessions[i] <- count
                        trial$Sessions[i+1] <- count

                    }else{
                        trial$Sessions[i] <- count
                        trial$Sessions[i+1] <- count
                        count <- count+1
                    }

                    count <- 1
                }
            }

UPDATE: I have found the answer to my question and a fast alternative to this code that you can find below!

解决方案

I have found a very effective and fast way to make it work using vectorial calculus. It took me 30 seconds to run the code (instead of average 5 hours!)

   library(data.table);library(sqldf)

        # Ordering by uuid and created_at
        LID<-LID[order(LID$uuid,LID$created_at),]

        # Computing time difference (sec) between the current and previous ligne 
        LID$created_at <- as.POSIXct(as.character(LID$created_at)) 
        LID$diff<-c(9999,LID$created_at[-1]-LID$created_at[-nrow(LID)])
        options(stringAsFactor = FALSE) 

        # Lines corresponding to a new uuid 
        w<-which(LID$uuid[-1]!=LID$uuid[-nrow(LID)])

        # Putting the duration to NA when there is a change of uuid
        LID$diff[w+1]<-9999

        # Identifying sessions changes that are greater than 3600 sec (1 hour)
        LID$chg_session<-as.numeric(LID$diff>3600)

        # Cumulating and determining the id_sessions with the inverse of Differencing
        LID$idsession<-diffinv(LID$chg_session)[-1]

这篇关于用快速计算创建用户会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆