用rvest循环遍历多个urls [英] loop across multiple urls in r with rvest

查看:191
本文介绍了用rvest循环遍历多个urls的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列的9个网址,我想从中抓取数据:

  http://www.basketball -reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active =&安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id& order_by_asc =& offset = 0 

当页面通过最后一页发生变化时,链接末尾的offset = 0从900到900(100)。我想循环遍历每个页面并刮擦每个表格,然后使用rbind将每个df按顺序堆叠在一起。我一直在使用rvest,并希望使用lapply,因为我比那更好的循环。

问题与此类似(从列表中收获多个HTML页面)但不同,因为我不希望在运行程序之前将所有链接复制到一个矢量。我想要一个通用的解决方案,以便如何遍历多个页面并收集数据,每次创建一个数据框。



以下作品适用于第一页:

  library(rvest)
library(stringr)
library(tidyr)

网站< - 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max =&安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; ORDER_BY = year_id&安培; order_by_asc =& offset = 0'

网页< - read_html(site)
draft_table< - html_nodes(网页,'表')
草稿< - html_table(draft_table )[[1]]

但我想在所有页面上重复此操作,而不必将网址粘贴到矢量中。我尝试了以下方法,但没有奏效:

$ $ p $ jump < - seq(0,900,by = 100)
网站< - paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=& pick_overall_min =安培; pick_overall_max =安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y'安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; order_by = year_id& order_by_asc =& offset =',jump,'。htm',sep =)

网页< - read_html(网站)
draft_table< - html_nodes网页,'表')
草稿< - html_table(draft_table)[[1]]



<因此,每个页面应该有一个数据框架,我想将它们放在一个列表中然后使用rbind来更容易堆叠它们。



任何帮助都将不胜感激! 您正在尝试矢量化一个方法,该方法不能在一次调用中使用多个项目。特别是, read_html()每次调用都需要一个页面,因为它需要一次读取一个Web数据,并且需要一个标量值。考虑使用 lapply 循环站点列表,然后绑定所有的dfs:

 跳转< -  seq(0,800,by = 100)
网站< - paste('http://www.basketball-reference.com /play-index/draft_finder.cgi ?,
'request = 1& year_min = 2001& year_max = 2014& round_min =& round_max =',
'& pick_overall_min =& pick_overall_max =& amp ; franch_id =& college_id = 0',
'& is_active =& is_hof =& pos_is_g = Y& pos_is_gf = Y& pos_is_f = Y& pos_is_fg = Y',
'& pos_is_fc = Y& pos_is_c = Y& pos_is_cf = Y& c1stat =& c1comp =& c1val =& c2stat =& c2comp =',
'& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id',
'& order_by_asc =& offset =',jump,sep =)
$ b $ (i){
网页< - read_html(i)
draft_table< - html_nodes(网页,'表')
草稿< - html_table(draft_table)[[1]]
})

finaldf< - do.call(rbind,dfList)#包含所有DFs维护同一个COLS


I have a series of 9 urls that I would like to scrape data from:

http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0 

The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.

The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.

The following works for the first page:

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:

jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.

Any help would be greatly appreciated!

解决方案

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)             # ASSUMING ALL DFs MAINTAIN SAME COLS

这篇关于用rvest循环遍历多个urls的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆