使用R Web抓取多个链接 [英] Web Scraping multiple Links using R

查看:49
本文介绍了使用R Web抓取多个链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个Web抓取程序,以从多张纸中搜索数据.下面的代码是我正在使用的示例.我只能得到第一张纸.如果有人可以指出我的语法哪里出了问题,那将有很大的帮助.

I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.

jump <- seq(1, 10, by = 1)

site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")


dflist <- lapply(site, function(i) {
   webpage <- read_html(i)
  draft_table <- html_nodes(webpage,'.excerpt')
  draft <- html_text(draft_table)
})



finaldf <- do.call(cbind, dflist) 

finaldf_10<-data.frame(finaldf)

View(finaldf_10)

下面是我需要从中抓取具有以下内容的数据的链接127页.

Below is the link from where I need to scrape the data which has 127 pages.

[ https://stackoverflow.com/search?q=%5Bazure%5D +免费+层] [1]

按照上面的代码,我只能从第一页获取数据,而不能从其余页面获取数据.也没有语法错误.您能帮我找出我要去哪里的地方吗?

As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.

推荐答案

某些网站设置了防止批量刮取的安全性.我猜SO是其中之一.有关此内容的更多信息: https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

实际上,如果您延迟一点通话,这将起作用.我尝试了5秒的 Sys.sleep .我想您可以减少它,但这可能行不通(我尝试了1秒钟的 Sys.sleep ,这行不通).

In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).

这是一个有效的代码:

library(rvest)
library(purrr)

dflist <- map(.x = 1:10, .f = function(x) {
  Sys.sleep(5)
  url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
  read_html(url) %>%
    html_nodes('.excerpt') %>%
    html_text() %>%
    as.data.frame()
}) %>% do.call(rbind, .)

最佳,

科林

这篇关于使用R Web抓取多个链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆