网页刮擦数据表与r rvest [英] web scraping data table with r rvest

查看:212
本文介绍了网页刮擦数据表与r rvest的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下网站上刮下一张表:

I'm trying to scrape a table from the following website:

http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats

该表名称为杂项统计,问题是这个网页上有多个表,我不知道我是否正在识别正确的表。我尝试了以下代码,但它创建的所有代码都是一个空白的数据框架:

The table is entitled "Miscellaneous Stats" and the problem is there are multiple tables on this webpage and I don't know if I'm identifying the correct one. I have attempted the following code but all it creates is a blank data frame:

library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="div_misc_stats"]') %>%
  html_table()
tmisc <- data.frame(tmisc)

我有一种感觉,我错过了一些小事,但我没有通过我所有的谷歌搜索找到这个。任何帮助深表感谢。

I have a feeling I'm missing something trivial but I haven't found this through all my google searches. Any help is much appreciated.

推荐答案

由于您希望的表格隐藏在一个注释中,直到被JavaScript显示出来,您需要使用RSelenium来运行JavaScript (这是一种痛苦),或解析意见(这仍然是一个痛苦,但稍微少一些)。

Since the table you want is hidden in a comment until revealed by JavaScript, you either need to use RSelenium to run the JavaScript (which is kind of a pain), or parse the comments (which is still a pain, but slightly less so).

library(rvest)
library(readr)    # for type_convert

adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"

h <- adv %>% read_html()    # be kind; don't rescrape unless necessary

df <- h %>% html_nodes(xpath = '//comment()') %>%    # select comments
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to single string
    read_html() %>%    # reread as HTML
    html_node('table#misc_stats') %>%    # select desired node
    html_table() %>%    # parse node to table
    { setNames(.[-1, ], paste0(names(.), .[1, ])) } %>%    # extract names from first row
    type_convert()    # fix column types

df[1:6, 1:14]
##   Rk                   Team  Age PW PL   MOV   SOS   SRS  ORtg  DRtg Pace   FTr  3PAr   TS%
## 2  1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3  2     San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3  99.0 93.8 0.246 0.223 0.564
## 4  3 Oklahoma City Thunder* 25.8 59 23  7.28 -0.19  7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5  4   Cleveland Cavaliers* 28.1 57 25  6.00 -0.55  5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6  5  Los Angeles Clippers* 29.7 53 29  4.28 -0.15  4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7  6       Toronto Raptors* 26.3 53 29  4.50 -0.42  4.08 110.0 105.2 92.9 0.328 0.287 0.552

这篇关于网页刮擦数据表与r rvest的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆