使用 RStudio Chromote 获取页面生成的 XHR 请求的响应正文的正确方法 [英] Correct way to get response body of XHR requests generated by a page with RStudio Chromote

查看：30 发布时间：2021/9/24 18:48:05 r web-scraping headless-browser chrome-devtools-protocol crrri

本文介绍了使用 RStudio Chromote 获取页面生成的 XHR 请求的响应正文的正确方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用 Chromote 来收集网站发出的 XHR 调用的响应正文，但我发现 API 有点难以掌握，尤其是异步管道.

I'd like to use Chromote to gather the response body of the XHR calls made by a website, but I find the API a bit complex to master, especially the async pipeline.

我想我需要先启用网络功能，然后加载页面(这可以做到)，但随后我需要:

I guess I need to first enable the Network functionality and then load the page (this can do), but then I need to:

列出所有 XHR 调用
通过识别请求 URL 中的模式来过滤它们
访问所选来源的请求正文

有人可以提供这方面的任何指导或教程材料吗?

Can someone provide any guidance or tutorial material on this regard?

更新:好的，我切换到包 crrri 并为此目的制作了一个通用函数.唯一缺少的部分是一些逻辑来决定何时关闭连接并返回结果:

UPDATE: Ok, I switched to package crrri and made a general function for the purpose. The only missing part is some logic to decide when to close the connection and return the results:

get_website_resources <- function(url, url_filter = '*', type_filter = '*') {
  library(crrri)
  library(dplyr)
  library(stringr)
  library(jsonlite)
  library(magrittr)

  chrome <- Chrome$new()
  
  out <- new.env()
  
  out$l <- list()
  
  client <- chrome$connect(callback = ~ NULL)
  
  Fetch <- client$Fetch
  Page <- client$Page
  
  Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
    Fetch$requestPaused(callback = function(params) {
      
      if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
        
        Fetch$getResponseBody(requestId = params$requestId) %...>% {
          resp <- .
          
          if (resp$body != '') {
            if (resp$base64Encoded) resp$body = base64_dec(resp$body) %>% rawToChar()
            
            body <- list(list(
              url = params$request$url,
              response = resp
            )) %>% set_names(params$requestId)
            
            str(body)
            
            out$l <- append(out$l, body)
          }
          
        }
      }
      
      Fetch$continueRequest(requestId = params$requestId)
    })
  } %...>% {
    Page$navigate(url)
  }
  
  
  out$l
}

推荐答案

破解它.这是最终的功能.它使用 crrri::perform_with_chrome 强制同步行为，并将流程的其余部分运行到 promise 对象中，并在外部定义了 resolve 回调.promise 本身，如果收集了大量资源或经过了一定时间，则会调用它:

Cracked it. Here's the final function. It uses a crrri::perform_with_chrome wich force synch behaviour and run the rest of the process into a promise object with a resolve callback defined outside the promise itself which is called either if a number of resources are collected or if a certain amount of time has passed:

get_website_resources <- function(url, url_filter = '*', type_filter = '*', wait_for = 20, n_of_resources = NULL, interactive = F) {

    library(crrri)
    library(promises)

    crrri::perform_with_chrome(function(client) {
        Fetch <- client$Fetch
        Page <- client$Page

        if (interactive) client$inspect()

        out <- new.env()

        out$results <- list()
        out$resolve_function <- NULL

        out$pr <- promises::promise(function(resolve, reject) {
            out$resolve_function <- resolve

            Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
                Fetch$requestPaused(callback = function(params) {

                    if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {

                        Fetch$getResponseBody(requestId = params$requestId) %...>% {
                            resp <- .

                            if (resp$body != '') {
                                if (resp$base64Encoded) resp$body = jsonlite::base64_dec(resp$body) %>% rawToChar()

                                body <- list(list(
                                    url = params$request$url,
                                    response = resp
                                )) %>% set_names(params$requestId)

                                #str(body)

                                out$results <- append(out$results, body)

                                if (!is.null(n_of_resources) & length(out$results) >= n_of_resources) out$resolve_function(out$results)
                            }

                        }
                    }

                    Fetch$continueRequest(requestId = params$requestId)
                })
            } %...>% {
                Page$navigate(url)
            } %>% crrri::wait(wait_for) %>%
                then(~ out$resolve_function(out$results))

        })

        out$pr$then(function(x) x)
    }, timeouts = max(wait_for + 3, 30), cleaning_timeout = max(wait_for + 3, 30))
}

这篇关于使用 RStudio Chromote 获取页面生成的 XHR 请求的响应正文的正确方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 RStudio Chromote 获取页面生成的 XHR 请求的响应正文的正确方法 [英] Correct way to get response body of XHR requests generated by a page with RStudio Chromote

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 RStudio Chromote 获取页面生成的 XHR 请求的响应正文的正确方法 [英] Correct way to get response body of XHR requests generated by a page with RStudio Chromote

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭