使用 rvest 在 r 中循环多个 url [英] loop across multiple urls in r with rvest

查看:31
本文介绍了使用 rvest 在 r 中循环多个 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列 9 个网址,我想从中抓取数据:

I have a series of 9 urls that I would like to scrape data from:

http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0 

当页面更改到最后一页时,链接末尾的 offset= 从 0 到 900(乘以 100).我想遍历每一页并抓取每个表,然后使用 rbind 将每个 df 按顺序堆叠在一起.我一直在使用 rvest 并且想使用 lapply,因为我比 for 循环更好.

The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.

问题与此类似(从 url 列表中获取 (rvest) 多个 HTML 页面)但不同,因为我会宁愿在运行程序之前不必将所有链接复制到一个向量.我想要一个关于如何遍历多个页面并收集数据的通用解决方案,每次创建一个数据框.

The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.

以下内容适用于第一页:

The following works for the first page:

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

但我想在所有页面上重复此操作,而不必将网址粘贴到矢量中.我尝试了以下方法,但没有奏效:

But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:

jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

所以每个页面都应该有一个数据框,我想把它们放在一个列表中然后使用 rbind 来堆叠它们会更容易.

So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.

任何帮助将不胜感激!

推荐答案

您正在尝试对无法在一次调用中接受多个项目的方法进行矢量化.具体来说,read_html() 每次调用需要一页,因为它需要一次读取一个网络数据并期望一个标量值.考虑使用 lapply 遍历 site 列表,然后将所有 dfs 绑定在一起:

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)             # ASSUMING ALL DFs MAINTAIN SAME COLS

这篇关于使用 rvest 在 r 中循环多个 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆