如何在R中使用readHTMLTable读取注释掉的HTML表 [英] How to read a commented out HTML table using readHTMLTable in R

查看:428
本文介绍了如何在R中使用readHTMLTable读取注释掉的HTML表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去,我已经能够在R中使用readHTMLTable来获取一些足球统计数据.当今年再次尝试这样做时,即使表格在网页上可见,也不会显示表格.这是一个示例: http://www.pro-football-reference.com/boxscores/201609080den.htm

In the past, I have been able to use readHTMLTable in R to pull some football stats. When trying to do so again this year, the tables aren't showing up, even though they are visible on the webpage. Here is an example: http://www.pro-football-reference.com/boxscores/201609080den.htm

当我查看页面的源代码时,所有表都被注释掉了(我怀疑这是为什么readHTMLTable找不到它们的原因).

When I view the source for the page, the tables are all commented out (which I suspect is why readHTMLTable didn't find them).

示例:在源代码中搜索"team_stats" ...

Example: search for "team_stats" in source code...

    <!--  
    <div class="table_outer_container">
    <div class="overthrow table_container" id="div_team_stats">
    <table class="stats_table" id="team_stats" data-cols-to-  freeze=1><caption>Team Stats Table</caption>

问题:

该表如何在源中被注释掉而又在浏览器中显示?

How can the table be commented out in the source yet display in the browser?

是否可以使用readHTMLTable(或其他方法)读取注释掉的表?

Is there a way to read the commented out tables using readHTMLTable (or some other method)?

推荐答案

实际上,如果您使用XPath comment()选择器,则可以抓住它:

You can, in fact, grab it if you use the XPath comment() selector:

library(rvest)

url <- 'http://www.pro-football-reference.com/boxscores/201609080den.htm'

url %>% read_html() %>%                   # parse html
    html_nodes('#all_team_stats') %>%     # select node with comment
    html_nodes(xpath = 'comment()') %>%   # select comments within node
    html_text() %>%                       # return contents as text
    read_html() %>%                       # parse text as html
    html_node('table') %>%                # select table node
    html_table()                          # parse table and return data.frame

##                                 CAR           DEN
## 1         First Downs            21            21
## 2        Rush-Yds-TDs      32-157-1      29-148-2
## 3   Cmp-Att-Yd-TD-INT 18-33-194-1-1 18-26-178-1-2
## 4        Sacked-Yards          3-18          2-19
## 5      Net Pass Yards           176           159
## 6         Total Yards           333           307
## 7        Fumbles-Lost           0-0           1-1
## 8           Turnovers             1             3
## 9     Penalties-Yards          8-85          4-22
## 10   Third Down Conv.          9-15          5-10
## 11  Fourth Down Conv.           0-0           1-1
## 12 Time of Possession         32:19         27:41

这篇关于如何在R中使用readHTMLTable读取注释掉的HTML表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆