Rvest刮擦和循环 [英] Scrape and Loop with Rvest

查看:84
本文介绍了Rvest刮擦和循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经审查了与该相似主题相关的SO类似问题的几个答案,但似乎都没有用.

I have reviewed several answers to similar questions on SO related to this similar topic but neither seem to work for me.

(在带有rvest的r中跨多个URL循环 )

(收获(rvest)多个HTML页面网址列表)

我有一个URL列表,我希望从每个URL中获取表并将其附加到主数据框.

I have a list of URLs and I wish to grab the table from each and append it to a master dataframe.

## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}


### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
  tbl<- urls[j] %>% 
    read_html() %>% 
    html_node("table") %>%
    html_table()
  table[[j]] <- tbl
}

第一部分按预期工作,并获取我要抓取的网址列表.我收到以下错误:

The first section works as expect and gets the list of urls I want to scrape. I get the following error:

 Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

关于如何纠正此错误并使3个表循环到单个DF中的任何建议?我感谢任何提示或指示.

Any suggestions on how to get correct for this error and get the 3 tables looped into a single DF? I appreciate any tips or pointers.

推荐答案

尝试一下:

library(tidyverse)
library(rvest)

page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}

### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
  tbl[[j]] <- urls[[j]] %>%   # tbl[[j]] assigns each table from your urls as an element in the tbl list
    read_html() %>% 
    html_node("table") %>%
    html_table()
  j <- j+1                    # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}

#convert list to data frame
tbl <- do.call(rbind, tbl)

无需在原始代码中的for循环末尾使用

table[[j]] <- tbl,因为我们在此处将每个url分配为tbl列表的元素:tbl[[j]] <- urls[[j]]

table[[j]] <- tbl at the end of your for loop in the original code was not necessary as we're assigning each url as an element of the tbl list here: tbl[[j]] <- urls[[j]]

这篇关于Rvest刮擦和循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆