使用 r rvest 进行网页抓取数据表 [英] web scraping data table with r rvest
问题描述
我正在尝试从以下网站抓取一张表格:
I'm trying to scrape a table from the following website:
http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats
该表格的标题是杂项统计",问题是该网页上有多个表格,我不知道我是否识别出正确的表格.我尝试了以下代码,但它创建的只是一个空白数据框:
The table is entitled "Miscellaneous Stats" and the problem is there are multiple tables on this webpage and I don't know if I'm identifying the correct one. I have attempted the following code but all it creates is a blank data frame:
library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_misc_stats"]') %>%
html_table()
tmisc <- data.frame(tmisc)
我有一种感觉,我错过了一些微不足道的东西,但我在所有谷歌搜索中都没有找到.任何帮助深表感谢.
I have a feeling I'm missing something trivial but I haven't found this through all my google searches. Any help is much appreciated.
推荐答案
由于你想要的表隐藏在注释中直到被 JavaScript 显示,你要么需要使用 RSelenium 来运行 JavaScript(这有点痛苦), 或者解析评论(这仍然很痛苦,但稍微不那么痛苦).
Since the table you want is hidden in a comment until revealed by JavaScript, you either need to use RSelenium to run the JavaScript (which is kind of a pain), or parse the comments (which is still a pain, but slightly less so).
library(rvest)
library(readr) # for type_convert
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
h <- adv %>% read_html() # be kind; don't rescrape unless necessary
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#misc_stats') %>% # select desired node
html_table() %>% # parse node to table
{ setNames(.[-1, ], paste0(names(.), .[1, ])) } %>% # extract names from first row
type_convert() # fix column types
df[1:6, 1:14]
## Rk Team Age PW PL MOV SOS SRS ORtg DRtg Pace FTr 3PAr TS%
## 2 1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3 2 San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3 99.0 93.8 0.246 0.223 0.564
## 4 3 Oklahoma City Thunder* 25.8 59 23 7.28 -0.19 7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5 4 Cleveland Cavaliers* 28.1 57 25 6.00 -0.55 5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6 5 Los Angeles Clippers* 29.7 53 29 4.28 -0.15 4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7 6 Toronto Raptors* 26.3 53 29 4.50 -0.42 4.08 110.0 105.2 92.9 0.328 0.287 0.552
这篇关于使用 r rvest 进行网页抓取数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!