抓取请求 rvest 同意 cookie 的站点 [英] Scrape site that asks for cookies consent with rvest

查看:24
本文介绍了抓取请求 rvest 同意 cookie 的站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取(使用 rvest)一个要求用户同意设置 cookie 的网站.如果我只是抓取页面,则 rvest 只会下载弹出窗口.代码如下:

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:

library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c") 
content %>% html_text()

结果似乎是请求同意的弹出窗口的内容.

The result seems to be the content of the popup window asking for consent.

有没有办法忽略或接受弹出窗口或提前设置 cookie 以便我可以访问网站的正文?

Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?

推荐答案

正如建议的那样,该网站是动态的,这意味着它是由 javascript 构建的.通常从 .js 文件重建(或直接不可能)这是如何完成的非常耗时,但在这种情况下,您实际上可以在网络分析"中看到浏览器的功能,即有一个非隐藏的 api 可以提供您想要的信息.这是对 api.karriere.nrw 的请求.

As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want. This is the request to api.karriere.nrw.

因此,您可以使用 url 的 uuid(数据库中的标识符)并向 api 发出简单的 GET 请求,然后直接转到源代码,而无需通过 RSelenium 进行渲染,这会占用额外的时间和资源.

Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.

友好一点,给他们发一些联系你的方式,这样他们就可以让你停下来.

Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.

library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe@company.com")

### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"

### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]

### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)

### get results
response <- httr::GET(api_url,
                    httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()

这篇关于抓取请求 rvest 同意 cookie 的站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆