使用rvest和html_nodes()和html_table()提取网站表 [英] Extract Website Tables using rvest and html_nodes() and html_table()
问题描述
我正在尝试从篮球参考网站中提取数据.
I'm trying to extract data from the Basketball Reference website.
library(rvest)
data7 <- read_html("http://www.basketball-reference.com/teams/CLE/2017.html") %>%
html_nodes("[id=roster]") %>%
html_table()
data7
上面的代码返回花名册"表中的数据.但是,以下代码不返回"team_misc"表,而是返回距离为零的列表:
The code above returns the data in the "roster" table. However, the following code does not return the "team_misc" table but instead returns a list with legth zero:
html_nodes("[id=team_misc]") %>%
我对rvest还是很陌生,所以如果有人对为什么这样做不起作用有任何想法,将不胜感激.
I'm fairly new to rvest so if anyone has any ideas why this does not work it would greatly be appreciated.
推荐答案
实际上已经有一个答案,但是它适用于旧版本的网站....之所以无法获得其他表,是因为它们是动态创建的,并且在R
中呈现原始页面时,所需的表都在注释字符串中.您应该在chrome上检查页面的元素,以查看我指的是什么.另一个答案是在这里如何用R在html的注释标签内刮擦表格?
There is actually already an answer to this but it applies to an older version of the website.... The reason you cannot get the other tables is because they are dynamically created and when rendering the raw page in R
the tables you want are in commented out strings. You should inspect-element of the page on chrome to see what I am referring to. The other answer is here How to scrape tables inside a comment tag in html with R?
但是对于您的年份数据:
But for your year data:
A <- read_html('http://www.basketball-reference.com/teams/CLE/2017.html') %>% # Read in the raw webpage
xml_find_all('//comment()') %>% # Use xpath to find all comment nodes
xml_text() %>% # convert to raw strings
paste0(collapse = "") %>% # flatten into a character vector
read_html %>% # re-read as html content
xml_find_all("//table") %>% html_table
cat(capture.output(lapply(A, head, 1)), sep = "\n")
[[1]]
Date Type Note
1 Kevin Love 2017-02-12 Knee Love is expected to miss six weeks after undergoing arthroscopic surgery on his left knee.
[[2]]
X1 X2
1 Jim Boylan  Assistant Coach
[[3]]
G MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Team 58 14020 2305 4938 0.467 761 1952 0.39 1544 2986 0.517 1073 1420 0.756 564 1988 2552 1304 414 237 804 1033 6444
[[4]]
NA NA NA NA NA NA NA NA NA NA Advanced NA Offense Four Factors NA NA NA Defense Four Factors NA NA NA NA
1 W L PW PL MOV SOS SRS ORtg DRtg Pace FTr 3PAr eFG% TOV% ORB% FT/FGA eFG% TOV% DRB% FT/FGA Arena Attendance
[[5]]
Rk Age G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS/G
1 1 LeBron James 32 54 54 37.5 9.6 17.7 0.541 1.7 4.4 0.387 7.9 13.3 0.592 0.589 4.8 6.9 0.691 1.1 6.7 7.9 8.9 1.4 0.6 4.3 1.7 25.7
[[6]]
Rk Age G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 1 LeBron James 32 54 54 2026 518 957 0.541 92 238 0.387 426 719 0.592 0.589 259 375 0.691 62 363 425 479 74 32 230 92 1387
[[7]]
Rk Age G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 1 LeBron James 32 54 54 2026 9.2 17 0.541 1.6 4.2 0.387 7.6 12.8 0.592 4.6 6.7 0.691 1.1 6.5 7.6 8.5 1.3 0.6 4.1 1.6 24.6
[[8]]
Rk Age G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS ORtg DRtg
1 1 LeBron James 32 54 54 2026 12.7 23.4 0.541 2.3 5.8 0.387 10.4 17.6 0.592 6.3 9.2 0.691 1.5 8.9 10.4 11.7 1.8 0.8 5.6 2.3 34 NA 118 107
[[9]]
Rk Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% Â OWS DWS WS WS/48 Â OBPM DBPM BPM VORP
1 1 LeBron James 32 54 2026 26.3 0.618 0.249 0.392 3.5 19.1 11.6 41.7 1.8 1.3 17 29.4 NA 6.9 2.4 9.3 0.22 NA 6.3 1.8 8 5.1
[[10]]
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2-Pt Field Goals NA NA 3-Pt Field Goals NA
1 <NA> <NA> <NA> <NA> <NA> <NA> % of FGA by Distance <NA> <NA> <NA> NA <NA> FG% by Distance <NA> <NA> <NA> NA <NA> Dunks <NA> Corner
NA NA NA
1 <NA> Heaves <NA>
[[11]]
Rk Salary
1 1 LeBron James $30,963,450
[[12]]
Yr Tm Rd Pk Team G MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Vladimir Veremeenko NA 2006 WAS 2 48 NA Reggio Emilia it 18 139 17 29 0.586 0 0 NA 4 9 0.444 14 10 24 8 2 3 9 33 38
这篇关于使用rvest和html_nodes()和html_table()提取网站表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!