R 3.4.1 - 为 RSiteCatalyst 入队报告智能使用 while 循环 [英] R 3.4.1 - Intelligent use of while loop for RSiteCatalyst enqueued reports

查看:12
本文介绍了R 3.4.1 - 为 RSiteCatalyst 入队报告智能使用 while 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 RSiteCatalyst 软件包已有一段时间了.对于那些不了解它的人,它使通过 API 从 Adob​​e Analytics 获取数据的过程变得更加容易.

I have been using the RSiteCatalyst package for a while right now. For those who do not know it, it makes the process of obtaining data from Adobe Analytics over the API easier.

到目前为止,工作流程如下:

Until now, the workflow was as follow:

  1. 发出请求,例如:

    key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                   metrics = c("pageviews"), date.granularity = "month",
                   max.attempts = 500, interval.seconds = 20) 

  1. 等待将保存为 data.frame(示例结构)的响应:

  1. Wait for the response which will be saved as a data.frame (example structure):

> View(head(key_metrics,1)) 
    datetime      name         year   month   day    pageviews 
  1 2015-07-01    July 2015    2015   7       1      45825

  • 做一些数据转换(例如:

  • Do some data transformation (for example:

    key_metrics$datetime <- as.Date(key_metrics$datetime)

    这个工作流的问题是有时(因为请求复杂),我们可以等待很长时间直到响应最终到来.如果 R 脚本包含 40-50 个同样复杂的 API 请求,这意味着我们将等待 40-50 次,直到数据最终到来,我们才能进行新的请求.这显然在我的 ETL 过程中产生了瓶颈.

    The problem with this workflow is that sometimes (because of request complexity), we can wait a lot of time until the response finally comes. If the R script contains 40-50 API requests which are same complex, that means that we will be waiting 40-50 times until data finally comes and we can do a new request. This is clearly generating a bootleneck in my ETL process.

    然而,在包的大多数函数中都有一个参数 enqueueOnly,它告诉 Adob​​e 在传递报告 ID 作为响应的同时处理请求:

    There is however a parameter enqueueOnly in most of the functions of the package, that tells Adobe to process the request while delivering a report Id as response:

    key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                   metrics = c("pageviews"), date.granularity = "month",
                   max.attempts = 500, interval.seconds = 20,
                   enqueueOnly = TRUE)
    
    > key_metrics
    [1] 1154642436 
    

    我可以获得真正的"使用以下函数随时响应(带有数据):

    I can obtain the "real" response (this with data) anytime by using following function:

    key_metrics <- GetReport(key_metrics)
    

    在每个请求中,我添加参数 enqueueOnly = TRUE 同时生成报告 ID 和报告名称列表:

    In each request I am adding the parameter enqueueOnly = TRUE while generating a list of Report Ids and Report Names:

    queueFromIds <- c(queueFromIds, key_metrics)
    queueFromNames <- c(queueFromNames, "key_metrics")
    

    与此方法最重要的区别是 Adob​​e 同时处理我的所有请求,因此等待时间大大减少.

    The most important difference with this approach is that all my requestes are being processed by Adobe at the same time, and therefore the waiting time is considerably decreased.

    但是,我在有效获取数据方面遇到了问题.我正在尝试使用 while 循环,该循环在获得数据后从先前的向量中删除密钥 ID 和密钥名称:

    I am having, however, problems by obtaining the data efficiently. I am trying with a while loop that removes the key ID and key Name from the previous vectors once data is obtained:

    while (length(queueFromNames)>0)
    {
      assign(queueFromNames[1], GetReport(queueFromIds[1],
                                          max.attempts = 3,
                                          interval.seconds = 5))
      queueFromNames <- queueFromNames[-1]
      queueFromIds <- queueFromIds[-1]
    }
    

    然而,这只有在请求足够简单可以在几秒钟内处理时才有效.当请求复杂到无法在 3 次尝试中以 5 秒为间隔进行处理时,循环停止并出现以下错误:

    However, this only works as long as the requests are simple enough to be processed in seconds. When the request is complex enough to not be processed in 3 attempts with an interval of 5 seconds, the loop stops with following error:

    ApiRequest(body = toJSON(request.body), func.name = 中的错误)Report.Get",:错误:超过最大尝试次数https://api3.omniture.com/admin/1.4/rest/?method=Report.Get

    Error in ApiRequest(body = toJSON(request.body), func.name = "Report.Get", : ERROR: max attempts exceeded for https://api3.omniture.com/admin/1.4/rest/?method=Report.Get

    哪些函数可以帮助我控制所有 API 请求都被正确处理,并且在最好的情况下,需要额外时间(它们会产生错误)的 API 请求会被跳过,直到循环结束,当他们再次被要求?

    Which functions may help me to control that all the API requests are being correctly processed, and, in the best scenario, API requests that need an extra time (they generate an error) are skipped until the end of the loop, when they are again requested?

    推荐答案

    我使用几个函数来独立生成/检索报告 ID.这样,处理报告所需的时间就无关紧要了.我通常会在报告 ID 生成 12 小时后回来找他们.我认为它们会在 48 小时左右后过期.这些功能当然依赖于 RSiteCatalyst.以下是函数:

    I use a couple of functions to generate/retrieve the report IDs independently. This way, it doesn't matter how long it takes the reports to be processed. I usually come back for them 12 hours after the report IDs were generated. I think they expire after 48 hours or so. These functions rely on RSiteCatalyst of course. Here are the functions:

    #' Generate report IDs to be retrieved later
    #'
    #' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
    #' @param suite Report suite ID.
    #' @param dateBegin Start date in the following format: YYYY-MM-DD.
    #' @param dateFinish End date in the following format: YYYY-MM-DD.
    #' @param metrics Vector containing up to 30 required metrics IDs.
    #' @param elements Vector containing element IDs.
    #' @param classification Vector containing classification IDs.
    #'@param valueStart Integer value pointing to row to start report with.
    #' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
    #' @examples
    #' dontrun{
    #' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
    #'}
    #' @export
        reportsGenerator <- function(suite,
                                     dateBegin,
                                     dateFinish,
                                     metrics,
                                     elements,
                                     classification,
                                     valueStart) {
    
          #Convert dates to date format.
          #Deduct one from dateBegin to
          #neutralize the initial +1 in the loop.
    
          dateBegin <-  as.Date(dateBegin, "%Y-%m-%d") - 1
          dateFinish <-  as.Date(dateFinish, "%Y-%m-%d")
          timeRange <- dateFinish - dateBegin
    
          #Create data frame to store dates and report IDs
          VisitorActivityReports <-
            data.frame(matrix(NA, nrow = timeRange, ncol = 2))
          names(VisitorActivityReports) <- c("Date", "ReportID")
    
          #Run a loop to retrieve one ReportID for each day in the time period.
          for (i in 1:timeRange) {
            dailyDate <- as.character(dateBegin + i)
            print(i) #Visibility to end user
            print(dailyDate) #Visibility to end user
            VisitorActivityReports[i, 1] <- dailyDate
    
    
            VisitorActivityReports[i, 2] <-
              RSiteCatalyst::QueueTrended(
                reportsuite.id = suite,
                date.from = dailyDate,
                date.to = dailyDate,
                metrics = metrics,
                elements = elements,
                classification = classification,
                top = 50000,
                max.attempts = 500,
                start = valueStart,
                enqueueOnly = T
              )
          }
          return(VisitorActivityReports)
        }
    

    您应该将前一个函数的输出分配给一个变量.然后使用该变量作为以下函数的输入.还将 reportsRetriever 的结果分配给一个变量.输出将是一个数据帧.该函数会将所有报告rbind 在一起,只要它们共享相同的结构即可.不要尝试连接不同结构的报告.

    You should assign the output of the previous function to a variable. Then use that variable as the input of the following function. Also assign the result of reportsRetriever to a variable. The output will be a dataframe. The function will rbind all the reports together as long as they all share the same structure. Don't try to concatenate reports with different structure.

    #' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
    #'
    #' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
    #' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
    #' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
    #' @examples
    #' dontrun{
    #' visitorActivity <- reportsRetriever(dataFrameReports)
    #'}
    #'
    #' @export    
    
    reportsRetriever <- function(dataFrameReports) {
    
          visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
          visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))
    
          #Validate report integrity
    
          if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
            print("Ok. All reports available")
            return(visitor.activity.df)
          } else {
            print("Some reports may have been missed.")
            missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)
    
            return(visitor.activity.df)
          }
    
        }
    

    这篇关于R 3.4.1 - 为 RSiteCatalyst 入队报告智能使用 while 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆