使用 R 抓取包含多个页面的 HTML 表格 [英] scrape HTML table with multiple pages using R

查看:35
本文介绍了使用 R 抓取包含多个页面的 HTML 表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过从网络上抓取来制作数据框.但是有多个页面组成了我试图抓取的表格.相同的链接,但页面不同.

I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.

对于第一页,我是​​这样抓取的:

for the first page, this is how I would scrape it:

library(XML)
CB.13<- "http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p=1&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
CB.13<- readHTMLTable(CB.13, header=FALSE)
cornerback.function<- function(CB.13){
  first<- "1"
  last<- "1"
  for (i in 1:length(CB.13)){
    lastrow<- nrow(CB.13[[i]])
    lastcol<- ncol(CB.13[[i]])
    if(as.numeric(CB.13[[i]] [1,1]) ==first & as.numeric(CB.13[[i]] [lastrow, lastcol]) ==last) {
      tab <- i
    }
  }
}
cornerback.function(CB.13)
cornerbacks.2013<- CB.13[[tab]]
cb.names<- c("Rk", "name", "Team", "Pos", "Comb", "Total", "Ast", "Sck", "SFTY", "PDef", "Int", "TDs", "Yds", "Lng", "FF", "Rec", "TD")
names(cornerbacks.2013)<- cb.names

我需要这样做多年,所有页面都有多个页面 - 那么有没有更快的方法来获取数据的所有页面,而不必为表格的每个单独页面执行此操作并合并它们?下一个链接将是 http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true

I need to do this for multiple years, all with multiple pages- so is there a quicker way to get all of the pages of the data instead of having to do this for each individual page of the table and merge them? the next link would be http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true

今年有 8 页——也许是一个 for 循环来遍历页面?

and there are 8 pages for this year- maybe a for loop to loop through pages?

推荐答案

您可以使用 paste0 动态创建 url,因为它们略有不同.对于某一年,您只更改页码.你得到一个 url 结构,如:

You can dynamically create the url using paste0 since that they slightly differ. For a certain year you change just the page number. You get an url structure like :

url <- paste0(url1,year,url2,page,url3) ## you change page or year or both

您可以创建一个函数来循环不同的页面,并返回一个表.然后你可以使用经典的 do.call(rbind,..) 绑定它们:

You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..):

library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"

getTable <- 
  function(page=1,year=2013){
    url <- paste0(url1,year,url2,page,url3)
    tab = readHTMLTable(url,header=FALSE) ## see comment !!
    tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))

一般方法

一般的方法是使用一些 xpath 标记来废弃下一页 url 并循环直到没有任何新的下一页.这可能更难做到,但它是最干净的解决方案.

the general method

The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .

getNext <- 
function(url=url_base){
  doc <- htmlParse(url)
  XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
  next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
  if(length(next_page)>0)
    paste0("http://www.nfl.com",next_page)
  else ''
}
## url_base is your first  url
res <- list()
while(TRUE){
  tab = readHTMLTable(url_base,header=FALSE)
  res <- rbind(res,tab$result)
  url_base <- getNext(url_base)
  if (nchar(url_base)==0)
    break
}

这篇关于使用 R 抓取包含多个页面的 HTML 表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆