R-使用rvest刮取受密码保护的网站,而无需在每次循环迭代时登录 [英] R - Using rvest to scrape a password protected website without logging in at each loop iteration

查看:109
本文介绍了R-使用rvest刮取受密码保护的网站,而无需在每次循环迭代时登录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用rvest软件包从R中受密码保护的网站中抓取数据.我的代码当前在循环的每次迭代中登录到网站,该循环将运行约15,000次.这似乎效率很低,但是我还没有找到解决方法,因为每次都没有先登录就跳到不同的URL会返回到网站的登录页面.我的代码简化如下:

I'm trying to scrape data from a password protected website in R using the rvest package. My code currently logs in to the website at each iteration of a loop that will run about 15,000 times. This seems very inefficient but I have not figured out a way around it, because jumping to a different url without first logging in every time returns to the website's log in page. A simplification of my code is as follows:

library(rvest)
url <- password protected website url within quotes
session <-html_session(url)
form <-html_form(session)[[1]]

filled_form <- set_values(form,
                      `username` = email within quotes, 
                      `password` = password within quotes)
start_table <- submit_form(session, filled_form) %>%
  jump_to(url from which to scrape first table within quotes) %>%
  html_node("table.inlayTable") %>%
  html_table()
data_table <- start_table

for(i in 1:nrow(data_ids))
{
current_table <- try(submit_form(session, filled_form) %>%
  jump_to(paste(first part of url within quotes, data_ids[i, ], last part of url within quotes, sep="")) %>%
  html_node("table.inlayTable") %>%
  html_table())

data_table <- rbind(data_table, current_table)
}

为简单起见,抑制了我处理try函数中引发的任何可能错误的方式.请注意,data_ids是一个数据框架,其中包含要在每次新的迭代中更新的部分网址.

For simplicity, the way I handle any possible errors thrown within the try function is suppressed. Note that data_ids is a data frame containing the part of the url to be updated at each new iteration.

有人在不进行循环的每次迭代的情况下如何实现这种抓取的建议吗?

Does anyone have a suggestion for how this scraping could be achieved without logging in at each iteration of the loop?

谢谢!颜

推荐答案

您可以将会话保存在变量中,但我想不会节省太多时间. 这是我的网页抓取脚本:

You can save the session in a variable but you won't save so many time I guess. Here is my script for web scraping :

library(rvest)
url <- "https://"
session <- html_session(url)              
form <- html_form(session)[[1]]

filled_form <- set_values(form,`[login]` = "xxx",`[password]` = "xxx")

session <- submit_form(session,filled_form)

for (i in unique(id)) {
      link <- paste0("https://",i,"xxx")
      df_all <- session %>% jump_to(link) %>% html_table()
        if ( length(df_all) != 0 ) {
          my_df <- as.data.frame(df_all[n],optional = TRUE)
          database <- rbind(my_df,database)
          cat("Data saved for",i)
          } else {
          cat("No data for",i)
          }
         }

这篇关于R-使用rvest刮取受密码保护的网站,而无需在每次循环迭代时登录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆