使用 R 中的 rvest 在多个网页上抓取表格 [英] Scraping tables on multiple web pages with rvest in R

查看:74
本文介绍了使用 R 中的 rvest 在多个网页上抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是网络抓取的新手,正在尝试抓取多个网页上的表格.这是网站:http://www.baseball-reference.com/teams/MIL/2016.shtml

I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml

我可以使用 rvest 在一页上轻松抓取表格.有多个表,但我只想抓取第一个,这是我的代码

I am able to scrape a table on one page rather easily using rvest. There are multiple tables, but I only wanted to scrape the first one, here is my code

library(rvest)
url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml"

Brewers2016 <- url4 %>% read_html() %>% 
html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% 
html_table()   

Brewers2016 <- as.data.frame(Brewers2016)

问题是我想抓取可追溯到 1970 年的页面上的第一个表格.在表格正上方的左上角有一个指定前一年的链接.有谁知道我怎么能做到这一点?

The problem is that I want to scrape the first table on the page dating back to 1970. There is a link specifying the previous year at the top left corner just above the table. Does anybody know how I can do this?

我也愿意接受不同的方式来做到这一点,例如,除 rvest 之外的其他软件包可能效果更好.我使用 rvest 因为它是我开始学习的那个.

I am also open to different ways of doing this, for example, a package other than rvest that might work better. I used rvest because it's the one I started learning.

推荐答案

一种方法是将您感兴趣的所有 urls 向量化,然后使用 sapply:

One way would be to make vector of all the urls you are interested in and then use sapply:

library(rvest)

years <- 1970:2016
urls <- paste0("http://www.baseball-reference.com/teams/MIL/", years, ".shtml")
# head(urls)

get_table <- function(url) {
  url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% 
    html_table()
}

results <- sapply(urls, get_table)

results 应该是 47 个 data.frame 对象的列表;每个都应该用它们代表的 url(即年份)命名.即results[1]对应1970年,results[47]对应2016年.

results should be a list of 47 data.frame objects; each should be named with the url (i.e., year) they represent. That is, results[1] corresponds to 1970, and results[47] corresponds to 2016.

这篇关于使用 R 中的 rvest 在多个网页上抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆