rvest:使用相同的标签跟随不同的链接 [英] rvest: follow different links with same tag

查看:159
本文介绍了rvest:使用相同的标签跟随不同的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R做了一个涉及从网站上抓取一些足球数据的小项目。以下是其中一年数据的链接:

I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data:

http://www.sports-reference.com/cfb/​​years/2007-schedule.html

正如您所看到的,有一个日期列,其中的日期已超链接,此超链接将您带到该特定游戏的统计数据,这是我想要的数据喜欢刮。不幸的是,很多游戏都发生在同一个日期,这意味着它们的超链接是相同的。所以,如果我从表中删除超链接(我已经完成),然后执行以下操作:

As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the hyperlinks from the table (which I have done) and then do something like:

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
  stats = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()
}

它将从每个日期对应的第一个链接中删除。例如,2007年8月30日发生了11场比赛,但是如果我把它放在follow_link函数中,它每次都会从第一场比赛(Boise St. Weber St.)获取数据。有什么方法可以说明我希望它向下移动吗?

it will scrape from the first link corresponding to each date. For example there were 11 games that took place on Aug 30, 2007, but if I put that in the follow_link function, it grabs data from the first game (Boise St. Weber St.) every time. Is there any way I can specify that I want it to move down the table?

我已经通过查找日期的网址公式找到了解决方法超链接带你,但这是一个非常复杂的过程,所以我想我会看到是否有人知道如何这样做。

I have already found a workaround by finding out the formula for the urls to which the date hyperlinks take you, but it's a pretty convoluted process, so I thought I'd see if anyone knew how to do it this way.

推荐答案

这是一个完整的例子:

library(rvest)
library(dplyr)
library(pbapply)

# Get the main page

URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)

# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")

# I'm only limiting to 10 since I rly don't care about football 
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free

scoring_games <- pblapply(links[1:10], function(x) {

  game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
  scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
  colnames(scoring) <- scoring[1,]
  filter(scoring[-1,], !Player %in% c("", "Player"))

})

# you can bind_rows them all together but you should 
# probably add a column for the game then

bind_rows(scoring_games)

## Source: local data frame [27 x 11]
## 
##             Player            School   Cmp   Att   Pct   Yds   Y/A  AY/A    TD   Int  Rate
##              (chr)             (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1     Taylor Tharp       Boise State    14    19  73.7   184   9.7  10.7     1     0 172.4
## 2       Nick Lomax       Boise State     1     5  20.0     5   1.0   1.0     0     0  28.4
## 3    Ricky Cookman       Boise State     1     2  50.0     9   4.5 -18.0     0     1 -12.2
## 4         Ben Mauk        Cincinnati    18    27  66.7   244   9.0   8.9     2     1 159.6
## 5        Tony Pike        Cincinnati     6     9  66.7    57   6.3   8.6     1     0 156.5
## 6   Julian Edelman        Kent State    17    26  65.4   161   6.2   3.5     1     2 114.7
## 7       Bret Meyer        Iowa State    14    23  60.9   148   6.4   3.4     1     2 111.9
## 8       Matt Flynn   Louisiana State    12    19  63.2   128   6.7   8.8     2     0 154.5
## 9  Ryan Perrilloux   Louisiana State     2     3  66.7    21   7.0  13.7     1     0 235.5
## 10   Michael Henig Mississippi State    11    28  39.3   120   4.3  -5.4     0     6  32.4
## ..             ...               ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

这篇关于rvest:使用相同的标签跟随不同的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆