使用 R 抓取包含多个页面的 HTML 表格 [英] scrape HTML table with multiple pages using R
问题描述
我正在尝试通过从网络上抓取来制作数据框.但是有多个页面组成了我试图抓取的表格.相同的链接,但页面不同.
I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.
对于第一页,我是这样抓取的:
for the first page, this is how I would scrape it:
library(XML)
CB.13<- "http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p=1&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
CB.13<- readHTMLTable(CB.13, header=FALSE)
cornerback.function<- function(CB.13){
first<- "1"
last<- "1"
for (i in 1:length(CB.13)){
lastrow<- nrow(CB.13[[i]])
lastcol<- ncol(CB.13[[i]])
if(as.numeric(CB.13[[i]] [1,1]) ==first & as.numeric(CB.13[[i]] [lastrow, lastcol]) ==last) {
tab <- i
}
}
}
cornerback.function(CB.13)
cornerbacks.2013<- CB.13[[tab]]
cb.names<- c("Rk", "name", "Team", "Pos", "Comb", "Total", "Ast", "Sck", "SFTY", "PDef", "Int", "TDs", "Yds", "Lng", "FF", "Rec", "TD")
names(cornerbacks.2013)<- cb.names
我需要这样做多年,所有页面都有多个页面 - 那么有没有更快的方法来获取数据的所有页面,而不必为表格的每个单独页面执行此操作并合并它们?下一个链接将是 http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true
I need to do this for multiple years, all with multiple pages- so is there a quicker way to get all of the pages of the data instead of having to do this for each individual page of the table and merge them? the next link would be http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true
今年有 8 页——也许是一个 for 循环来遍历页面?
and there are 8 pages for this year- maybe a for loop to loop through pages?
推荐答案
您可以使用 paste0
动态创建 url,因为它们略有不同.对于某一年,您只更改页码.你得到一个 url 结构,如:
You can dynamically create the url using paste0
since that they slightly differ. For a certain year you change just the page number. You get an url structure like :
url <- paste0(url1,year,url2,page,url3) ## you change page or year or both
您可以创建一个函数来循环不同的页面,并返回一个表.然后你可以使用经典的 do.call(rbind,..)
绑定它们:
You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..)
:
library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
getTable <-
function(page=1,year=2013){
url <- paste0(url1,year,url2,page,url3)
tab = readHTMLTable(url,header=FALSE) ## see comment !!
tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))
一般方法
一般的方法是使用一些 xpath 标记来废弃下一页 url 并循环直到没有任何新的下一页.这可能更难做到,但它是最干净的解决方案.
the general method
The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .
getNext <-
function(url=url_base){
doc <- htmlParse(url)
XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
if(length(next_page)>0)
paste0("http://www.nfl.com",next_page)
else ''
}
## url_base is your first url
res <- list()
while(TRUE){
tab = readHTMLTable(url_base,header=FALSE)
res <- rbind(res,tab$result)
url_base <- getNext(url_base)
if (nchar(url_base)==0)
break
}
这篇关于使用 R 抓取包含多个页面的 HTML 表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!