用rvest循环遍历多个urls [英] loop across multiple urls in r with rvest

查看：191 发布时间：2018/6/15 11:18:28 html r url web-scraping rvest

本文介绍了用rvest循环遍历多个urls的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一系列的9个网址，我想从中抓取数据：

  http://www.basketball -reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active =&安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id& order_by_asc =& offset = 0

当页面通过最后一页发生变化时，链接末尾的offset = 0从900到900（100）。我想循环遍历每个页面并刮擦每个表格，然后使用rbind将每个df按顺序堆叠在一起。我一直在使用rvest，并希望使用lapply，因为我比那更好的循环。

问题与此类似（从列表中收获多个HTML页面）但不同，因为我不希望在运行程序之前将所有链接复制到一个矢量。我想要一个通用的解决方案，以便如何遍历多个页面并收集数据，每次创建一个数据框。

以下作品适用于第一页： library（rvest） library（stringr） library（tidyr）网站< - 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max =&安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y&安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; ORDER_BY = year_id&安培; order_by_asc =& offset = 0' 网页< - read_html（site） draft_table< - html_nodes（网页，'表'）草稿< - html_table（draft_table ）[[1]] 但我想在所有页面上重复此操作，而不必将网址粘贴到矢量中。我尝试了以下方法，但没有奏效： $ $ p $ jump < - seq（0，900，by = 100）网站< - paste（'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=& pick_overall_min =安培; pick_overall_max =安培; franch_id =安培; college_id = 0&安培; IS_ACTIVE =安培; is_hof =安培; pos_is_g = Y&安培; pos_is_gf = Y&安培; pos_is_f = Y&安培; pos_is_fg = Y&安培; pos_is_fc = Y&安培; pos_is_c = Y&安培; pos_is_cf = Y'安培; c1stat =安培; c1comp =安培; c1val =安培; c2stat =安培; c2comp =安培; c2val =安培; c3stat =安培; c3comp =安培; c3val =安培; c4stat =安培; c4comp =安培; c4val =安培; order_by = year_id& order_by_asc =& offset ='，jump，'。htm'，sep =）网页< - read_html（网站） draft_table< - html_nodes网页，'表'）草稿< - html_table（draft_table）[[1]] <因此，每个页面应该有一个数据框架，我想将它们放在一个列表中然后使用rbind来更容易堆叠它们。任何帮助都将不胜感激！您正在尝试矢量化一个方法，该方法不能在一次调用中使用多个项目。特别是， read_html（）每次调用都需要一个页面，因为它需要一次读取一个Web数据，并且需要一个标量值。考虑使用 lapply 循环站点列表，然后绑定所有的dfs：跳转< - seq（0，800，by = 100）网站< - paste（'http://www.basketball-reference.com /play-index/draft_finder.cgi ?, 'request = 1& year_min = 2001& year_max = 2014& round_min =& round_max ='， '& pick_overall_min =& pick_overall_max =& amp ; franch_id =& college_id = 0'， '& is_active =& is_hof =& pos_is_g = Y& pos_is_gf = Y& pos_is_f = Y& pos_is_fg = Y'， '& pos_is_fc = Y& pos_is_c = Y& pos_is_cf = Y& c1stat =& c1comp =& c1val =& c2stat =& c2comp ='， '& c2val =& c3stat =& c3comp =& c3val =& c4stat =& c4comp =& c4val =& order_by = year_id'， '& order_by_asc =& offset ='，jump，sep =） $ b $ （i）{ 网页< - read_html（i） draft_table< - html_nodes（网页，'表'）草稿< - html_table（draft_table）[[1]] }） finaldf< - do.call（rbind，dfList）＃包含所有DFs维护同一个COLS I have a series of 9 urls that I would like to scrape data from: http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0 The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops. The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time. The following works for the first page: library(rvest) library(stringr) library(tidyr) site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' webpage <- read_html(site) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work: jump <- seq(0, 900, by = 100) site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="") webpage <- read_html(site) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them. Any help would be greatly appreciated! 解决方案 You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together: jump <- seq(0, 800, by = 100) site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?', 'request=1&year_min=2001&year_max=2014&round_min=&round_max=', '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0', '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y', '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=', '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id', '&order_by_asc=&offset=', jump, sep="") dfList <- lapply(site, function(i) { webpage <- read_html(i) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] }) finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS 这篇关于用rvest循环遍历多个urls的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

用rvest循环遍历多个urls [英] loop across multiple urls in r with rvest

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

用rvest循环遍历多个urls [英] loop across multiple urls in r with rvest

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭