如何使用 R 在 html 中的注释标签内抓取表格? [英] How to scrape tables inside a comment tag in html with R?

查看:25
本文介绍了如何使用 R 在 html 中的注释标签内抓取表格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 http://www.basketball-reference 中抓取.com/teams/CHI/2015.html 使用 rvest.我使用了 selectorgadget 并发现我想要的表的标签是#advanced.但是,我注意到它没有捡起来.查看页面源代码,我注意到表格位于 html 注释标签 <!--

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--

从评论标签中获取表格的最佳方法是什么?谢谢!

What is the best way to get the tables from inside the comment tags? Thanks!

我正在尝试拉高级"表:http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none

I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none

推荐答案

好的..知道了.

library(stringi)
library(knitr)
library(rvest)


 any_version_html <- function(x){
       XML::htmlParse(x)
    }
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))

e <- html_table(any_version_html(d))


> kable(summary(e),'rst')
======  ==========  ====
Length  Class       Mode
======  ==========  ====
9       data.frame  list
2       data.frame  list
24      data.frame  list
21      data.frame  list
28      data.frame  list
28      data.frame  list
27      data.frame  list
30      data.frame  list
27      data.frame  list
27      data.frame  list
28      data.frame  list
28      data.frame  list
27      data.frame  list
30      data.frame  list
27      data.frame  list
27      data.frame  list
3       data.frame  list
======  ==========  ====


kable(e[[1]],'rst')


===  ================  ===  ====  ===  ==================  ===  ===  =================================
No.  Player            Pos  Ht     Wt  Birth Date          Â    Exp  College                          
===  ================  ===  ====  ===  ==================  ===  ===  =================================
 41  Cameron Bairstow  PF   6-9   250  December 7, 1990    au   R    University of New Mexico         
  0  Aaron Brooks      PG   6-0   161  January 14, 1985    us   6    University of Oregon             
 21  Jimmy Butler      SG   6-7   220  September 14, 1989  us   3    Marquette University             
 34  Mike Dunleavy     SF   6-9   230  September 15, 1980  us   12   Duke University                  
 16  Pau Gasol         PF   7-0   250  July 6, 1980        es   13                                    
 22  Taj Gibson        PF   6-9   225  June 24, 1985       us   5    University of Southern California
 12  Kirk Hinrich      SG   6-4   190  January 2, 1981     us   11   University of Kansas             
  3  Doug McDermott    SF   6-8   225  January 3, 1992     us   R    Creighton University    


## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.

# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')

===  ==============  ===========
 Rk  Player          Salary     
===  ==============  ===========
  1  Derrick Rose    $18,862,875
  2  Carlos Boozer   $13,550,000
  3  Joakim Noah     $12,200,000
  4  Taj Gibson      $8,000,000 
  5  Pau Gasol       $7,128,000 
  6  Nikola Mirotic  $5,305,000 
===  ==============  ===========

这篇关于如何使用 R 在 html 中的注释标签内抓取表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆