用rvest循环遍历多个urls [英] loop across multiple urls in r with rvest
问题描述
我有一系列的9个网址,我想从中抓取数据:
http://www.basketball -reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active =&安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id& order_by_asc =& offset = 0
当页面通过最后一页发生变化时,链接末尾的offset = 0从900到900(100)。我想循环遍历每个页面并刮擦每个表格,然后使用rbind将每个df按顺序堆叠在一起。我一直在使用rvest,并希望使用lapply,因为我比那更好的循环。
问题与此类似(从列表中收获多个HTML页面)但不同,因为我不希望在运行程序之前将所有链接复制到一个矢量。我想要一个通用的解决方案,以便如何遍历多个页面并收集数据,每次创建一个数据框。
以下作品适用于第一页:
library(rvest)
library(stringr)
library(tidyr)
网站< - 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max =&安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; ORDER_BY = year_id&安培; order_by_asc =& offset = 0'
网页< - read_html(site)
draft_table< - html_nodes(网页,'表')
草稿< - html_table(draft_table )[[1]]
但我想在所有页面上重复此操作,而不必将网址粘贴到矢量中。我尝试了以下方法,但没有奏效:
$ $ p $ jump < - seq(0,900,by = 100)
网站< - paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=& pick_overall_min =安培; pick_overall_max =安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y'安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; order_by = year_id& order_by_asc =& offset =',jump,'。htm',sep =)
网页< - read_html(网站)
draft_table< - html_nodes网页,'表')
草稿< - html_table(draft_table)[[1]]
<因此,每个页面应该有一个数据框架,我想将它们放在一个列表中然后使用rbind来更容易堆叠它们。
任何帮助都将不胜感激! 您正在尝试矢量化一个方法,该方法不能在一次调用中使用多个项目。特别是, read_html()
每次调用都需要一个页面,因为它需要一次读取一个Web数据,并且需要一个标量值。考虑使用 lapply
循环站点
列表,然后绑定所有的dfs:
跳转< - seq(0,800,by = 100)
网站< - paste('http://www.basketball-reference.com /play-index/draft_finder.cgi ?,
'request = 1& year_min = 2001& year_max = 2014& round_min =& round_max =',
'& pick_overall_min =& pick_overall_max =& amp ; franch_id =& college_id = 0',
'& is_active =& is_hof =& pos_is_g = Y& pos_is_gf = Y& pos_is_f = Y& pos_is_fg = Y',
'& pos_is_fc = Y& pos_is_c = Y& pos_is_cf = Y& c1stat =& c1comp =& c1val =& c2stat =& c2comp =',
'& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id',
'& order_by_asc =& offset =',jump,sep =)
$ b $ (i){
网页< - read_html(i)
draft_table< - html_nodes(网页,'表')
草稿< - html_table(draft_table)[[1]]
})
finaldf< - do.call(rbind,dfList)#包含所有DFs维护同一个COLS
I have a series of 9 urls that I would like to scrape data from:
http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0
The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.
The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.
The following works for the first page:
library(rvest)
library(stringr)
library(tidyr)
site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0'
webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:
jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")
webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.
Any help would be greatly appreciated!
You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html()
requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site
list with lapply
then bind all dfs together:
jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
'&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
'&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
'&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
'&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
'&order_by_asc=&offset=', jump, sep="")
dfList <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
})
finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS
这篇关于用rvest循环遍历多个urls的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!