R 3.4.1-对RSiteCatalyst入队报告的while循环的智能使用 [英] R 3.4.1 - Intelligent use of while loop for RSiteCatalyst enqueued reports

查看:70
本文介绍了R 3.4.1-对RSiteCatalyst入队报告的while循环的智能使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 RSiteCatalyst 包现在有一阵子。对于那些不了解它的人,它使通过API从Adobe Analytics获取数据的过程变得更加容易。

I have been using the RSiteCatalyst package for a while right now. For those who do not know it, it makes the process of obtaining data from Adobe Analytics over the API easier.

直到现在,工作流程如下:

Until now, the workflow was as follow:


  1. 发出请求,例如:


    key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                   metrics = c("pageviews"), date.granularity = "month",
                   max.attempts = 500, interval.seconds = 20) 



  1. 等待响应,该响应将另存为data.frame (示例结构):

  1. Wait for the response which will be saved as a data.frame (example structure):

> View(head(key_metrics,1)) 
    datetime      name         year   month   day    pageviews 
  1 2015-07-01    July 2015    2015   7       1      45825



  • 进行一些数据转换(例如:

  • Do some data transformation (for example:

    key_metrics $ datetime< ;-as.Date(key_metrics $ datetime)

    此工作流程的问题是有时(由于请求复杂性),我们可以等待很多时间直到响应最终到来。如果R脚本包含40-50个同样复杂的API请求,则意味着我们将等待40-50次直到数据终于来了,我们可以提出一个新请求。这显然在我的ETL过程中产生了问题。

    The problem with this workflow is that sometimes (because of request complexity), we can wait a lot of time until the response finally comes. If the R script contains 40-50 API requests which are same complex, that means that we will be waiting 40-50 times until data finally comes and we can do a new request. This is clearly generating a bootleneck in my ETL process.

    但是有一个参数 enqueueOnly 包含程序包的大多数功能,该命令告诉Adobe在发送报告ID作为响应的同时处理请求:

    There is however a parameter enqueueOnly in most of the functions of the package, that tells Adobe to process the request while delivering a report Id as response:

    key_metrics <- QueueOvertime(clientId, dateFrom4, dateTo,
                   metrics = c("pageviews"), date.granularity = "month",
                   max.attempts = 500, interval.seconds = 20,
                   enqueueOnly = TRUE)
    
    > key_metrics
    [1] 1154642436 
    

    我可以获得真实的通过使用以下函数随时响应(包含数据):

    I can obtain the "real" response (this with data) anytime by using following function:

    key_metrics <- GetReport(key_metrics)
    

    在每个请求中,我在生成报告ID和报告名称列表的同时添加参数 enqueueOnly = TRUE

    In each request I am adding the parameter enqueueOnly = TRUE while generating a list of Report Ids and Report Names:

    queueFromIds <- c(queueFromIds, key_metrics)
    queueFromNames <- c(queueFromNames, "key_metrics")
    

    这种方法最重要的区别是Adobe会同时处理我所有的请求,因此等待时间为

    The most important difference with this approach is that all my requestes are being processed by Adobe at the same time, and therefore the waiting time is considerably decreased.

    但是,我遇到了有效获取数据的问题。我正在尝试使用 while 循环,一旦获得数据,该循环将从先前的向量中删除键ID和键名称:

    I am having, however, problems by obtaining the data efficiently. I am trying with a while loop that removes the key ID and key Name from the previous vectors once data is obtained:

    while (length(queueFromNames)>0)
    {
      assign(queueFromNames[1], GetReport(queueFromIds[1],
                                          max.attempts = 3,
                                          interval.seconds = 5))
      queueFromNames <- queueFromNames[-1]
      queueFromIds <- queueFromIds[-1]
    }
    

    但是,这仅在请求足够简单且可以在几秒钟内处理的情况下有效。当请求足够复杂而无法在5秒间隔内进行3次尝试时,循环将停止,并显示以下错误:

    However, this only works as long as the requests are simple enough to be processed in seconds. When the request is complex enough to not be processed in 3 attempts with an interval of 5 seconds, the loop stops with following error:


    ApiRequest(body中的错误= toJSON(request.body),func.name =
    Report.Get,:错误:
    的最大尝试次数 https://api3.omniture.com/admin/1.4/rest/?method=Report.Get

    哪些功能可以帮助我控制所有API请求均已正确处理,并且在最佳情况下,API请求需要额外的时间(它们会产生错误

    Which functions may help me to control that all the API requests are being correctly processed, and, in the best scenario, API requests that need an extra time (they generate an error) are skipped until the end of the loop, when they are again requested?

    推荐答案

    我使用了几个函数来生成/独立检索报告ID。这样,处理报告需要多长时间都没关系,通常我会在报告后12小时回来生成了ID。我认为它们会在48小时左右后失效。这些功能当然依赖于RSiteCatalyst。下面是函数:

    I use a couple of functions to generate/retrieve the report IDs independently. This way, it doesn't matter how long it takes the reports to be processed. I usually come back for them 12 hours after the report IDs were generated. I think they expire after 48 hours or so. These functions rely on RSiteCatalyst of course. Here are the functions:

    #' Generate report IDs to be retrieved later
    #'
    #' @description This function works in tandem with other functions to programatically extract big datasets from Adobe Analytics.
    #' @param suite Report suite ID.
    #' @param dateBegin Start date in the following format: YYYY-MM-DD.
    #' @param dateFinish End date in the following format: YYYY-MM-DD.
    #' @param metrics Vector containing up to 30 required metrics IDs.
    #' @param elements Vector containing element IDs.
    #' @param classification Vector containing classification IDs.
    #'@param valueStart Integer value pointing to row to start report with.
    #' @return A data frame containing all the report IDs per day. They are required to obtain all trended reports during the specified time frame.
    #' @examples
    #' \dontrun{
    #' ReportsIDs <- reportsGenerator(suite,dateBegin,dateFinish,metrics, elements,classification)
    #'}
    #' @export
        reportsGenerator <- function(suite,
                                     dateBegin,
                                     dateFinish,
                                     metrics,
                                     elements,
                                     classification,
                                     valueStart) {
    
          #Convert dates to date format.
          #Deduct one from dateBegin to
          #neutralize the initial +1 in the loop.
    
          dateBegin <-  as.Date(dateBegin, "%Y-%m-%d") - 1
          dateFinish <-  as.Date(dateFinish, "%Y-%m-%d")
          timeRange <- dateFinish - dateBegin
    
          #Create data frame to store dates and report IDs
          VisitorActivityReports <-
            data.frame(matrix(NA, nrow = timeRange, ncol = 2))
          names(VisitorActivityReports) <- c("Date", "ReportID")
    
          #Run a loop to retrieve one ReportID for each day in the time period.
          for (i in 1:timeRange) {
            dailyDate <- as.character(dateBegin + i)
            print(i) #Visibility to end user
            print(dailyDate) #Visibility to end user
            VisitorActivityReports[i, 1] <- dailyDate
    
    
            VisitorActivityReports[i, 2] <-
              RSiteCatalyst::QueueTrended(
                reportsuite.id = suite,
                date.from = dailyDate,
                date.to = dailyDate,
                metrics = metrics,
                elements = elements,
                classification = classification,
                top = 50000,
                max.attempts = 500,
                start = valueStart,
                enqueueOnly = T
              )
          }
          return(VisitorActivityReports)
        }
    

    您应将前一个函数的输出分配给变量。然后使用该变量作为以下函数的输入。还要将 reportsRetriever 的结果分配给一个变量。输出将是一个数据框。只要它们共享相同的结构,该函数就会将所有报告一起 rbind 。请勿尝试合并具有不同结构的报告。

    You should assign the output of the previous function to a variable. Then use that variable as the input of the following function. Also assign the result of reportsRetriever to a variable. The output will be a dataframe. The function will rbind all the reports together as long as they all share the same structure. Don't try to concatenate reports with different structure.

    #' Retrieve all reports stored as output of reportsGenerator function and consolidate them.
    #'
    #' @param dataFrameReports This is the output from reportsGenerator function. It MUST contain a column titled: ReportID
    #' @details It is recommended to break the input data frame in chunks of 50 rows in order to prevent memory issues if the reports are too large. Otherwise the server or local computer might run out of memory.
    #' @return A data frame containing all the consolidated reports defined by the reportsGenerator function.
    #' @examples
    #' \dontrun{
    #' visitorActivity <- reportsRetriever(dataFrameReports)
    #'}
    #'
    #' @export    
    
    reportsRetriever <- function(dataFrameReports) {
    
          visitor.activity.list <- lapply(dataFrameReports$ReportID, tryCatch(GetReport))
          visitor.activity.df <- as.data.frame(do.call(rbind, visitor.activity.list))
    
          #Validate report integrity
    
          if (identical(as.character(unique(visitor.activity.df$datetime)), dataFrameReports$Date)) {
            print("Ok. All reports available")
            return(visitor.activity.df)
          } else {
            print("Some reports may have been missed.")
            missingReportsIndex <- !(as.character(unique(visitor.activity.df$datetime)) %in% dataFrameReports$Date)
    
            return(visitor.activity.df)
          }
    
        }
    

    这篇关于R 3.4.1-对RSiteCatalyst入队报告的while循环的智能使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆