使用rvest和html_nodes()和html_table()提取网站表 [英] Extract Website Tables using rvest and html_nodes() and html_table()

查看:100
本文介绍了使用rvest和html_nodes()和html_table()提取网站表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从篮球参考网站中提取数据.

I'm trying to extract data from the Basketball Reference website.

library(rvest)
data7 <- read_html("http://www.basketball-reference.com/teams/CLE/2017.html") %>%
html_nodes("[id=roster]") %>%
html_table()
data7

上面的代码返回花名册"表中的数据.但是,以下代码不返回"team_misc"表,而是返回距离为零的列表:

The code above returns the data in the "roster" table. However, the following code does not return the "team_misc" table but instead returns a list with legth zero:

html_nodes("[id=team_misc]") %>%

我对rvest还是很陌生,所以如果有人对为什么这样做不起作用有任何想法,将不胜感激.

I'm fairly new to rvest so if anyone has any ideas why this does not work it would greatly be appreciated.

推荐答案

实际上已经有一个答案,但是它适用于旧版本的网站....之所以无法获得其他表,是因为它们是动态创建的,并且在R中呈现原始页面时,所需的表都在注释字符串中.您应该在chrome上检查页面的元素,以查看我指的是什么.另一个答案是在这里如何用R在html的注释标签内刮擦表格?

There is actually already an answer to this but it applies to an older version of the website.... The reason you cannot get the other tables is because they are dynamically created and when rendering the raw page in R the tables you want are in commented out strings. You should inspect-element of the page on chrome to see what I am referring to. The other answer is here How to scrape tables inside a comment tag in html with R?

但是对于您的年份数据:

But for your year data:

A <- read_html('http://www.basketball-reference.com/teams/CLE/2017.html') %>% # Read in the raw webpage
  xml_find_all('//comment()') %>% # Use xpath to find all comment nodes
  xml_text() %>% # convert to raw strings 
  paste0(collapse = "") %>% # flatten into a character vector
  read_html %>% # re-read as html content 
        xml_find_all("//table") %>% html_table

cat(capture.output(lapply(A, head, 1)), sep = "\n")


[[1]]
                   Date Type                                                                                       Note
1 Kevin Love 2017-02-12 Knee Love is expected to miss six weeks after undergoing arthroscopic surgery on his left knee.

[[2]]
            X1                X2
1 Jim Boylan   Assistant Coach

[[3]]
        G    MP   FG  FGA   FG%  3P  3PA  3P%   2P  2PA   2P%   FT  FTA   FT% ORB  DRB  TRB  AST STL BLK TOV   PF  PTS
1 Team 58 14020 2305 4938 0.467 761 1952 0.39 1544 2986 0.517 1073 1420 0.756 564 1988 2552 1304 414 237 804 1033 6444

[[4]]
   NA NA NA NA  NA  NA  NA   NA   NA   NA Advanced   NA Offense Four Factors   NA   NA     NA Defense Four Factors   NA   NA     NA               NA
1   W  L PW PL MOV SOS SRS ORtg DRtg Pace      FTr 3PAr                 eFG% TOV% ORB% FT/FGA                 eFG% TOV% DRB% FT/FGA Arena Attendance

[[5]]
  Rk              Age  G GS   MP  FG  FGA   FG%  3P 3PA   3P%  2P  2PA   2P%  eFG%  FT FTA   FT% ORB DRB TRB AST STL BLK TOV  PF PTS/G
1  1 LeBron James  32 54 54 37.5 9.6 17.7 0.541 1.7 4.4 0.387 7.9 13.3 0.592 0.589 4.8 6.9 0.691 1.1 6.7 7.9 8.9 1.4 0.6 4.3 1.7  25.7

[[6]]
  Rk              Age  G GS   MP  FG FGA   FG% 3P 3PA   3P%  2P 2PA   2P%  eFG%  FT FTA   FT% ORB DRB TRB AST STL BLK TOV PF  PTS
1  1 LeBron James  32 54 54 2026 518 957 0.541 92 238 0.387 426 719 0.592 0.589 259 375 0.691  62 363 425 479  74  32 230 92 1387

[[7]]
  Rk              Age  G GS   MP  FG FGA   FG%  3P 3PA   3P%  2P  2PA   2P%  FT FTA   FT% ORB DRB TRB AST STL BLK TOV  PF  PTS
1  1 LeBron James  32 54 54 2026 9.2  17 0.541 1.6 4.2 0.387 7.6 12.8 0.592 4.6 6.7 0.691 1.1 6.5 7.6 8.5 1.3 0.6 4.1 1.6 24.6

[[8]]
  Rk              Age  G GS   MP   FG  FGA   FG%  3P 3PA   3P%   2P  2PA   2P%  FT FTA   FT% ORB DRB  TRB  AST STL BLK TOV  PF PTS    ORtg DRtg
1  1 LeBron James  32 54 54 2026 12.7 23.4 0.541 2.3 5.8 0.387 10.4 17.6 0.592 6.3 9.2 0.691 1.5 8.9 10.4 11.7 1.8 0.8 5.6 2.3  34 NA  118  107

[[9]]
  Rk              Age  G   MP  PER   TS%  3PAr   FTr ORB% DRB% TRB% AST% STL% BLK% TOV% USG% Â  OWS DWS  WS WS/48 Â  OBPM DBPM BPM VORP
1  1 LeBron James  32 54 2026 26.3 0.618 0.249 0.392  3.5 19.1 11.6 41.7  1.8  1.3   17 29.4 NA 6.9 2.4 9.3  0.22 NA  6.3  1.8   8  5.1

[[10]]
     NA   NA   NA   NA   NA   NA                   NA   NA   NA   NA NA   NA              NA   NA   NA   NA NA   NA 2-Pt Field Goals    NA   NA 3-Pt Field Goals     NA
1  <NA> <NA> <NA> <NA> <NA> <NA> % of FGA by Distance <NA> <NA> <NA> NA <NA> FG% by Distance <NA> <NA> <NA> NA <NA>                  Dunks <NA>                  Corner
    NA     NA   NA
1 <NA> Heaves <NA>

[[11]]
  Rk                   Salary
1  1 LeBron James $30,963,450

[[12]]
                           Yr  Tm Rd Pk             Team     G  MP FG FGA   FG% 3P 3PA 3P% FT FTA   FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Vladimir Veremeenko NA 2006 WAS  2 48 NA Reggio Emilia it 18 139 17  29 0.586  0   0  NA  4   9 0.444  14  10  24   8   2   3   9 33  38

这篇关于使用rvest和html_nodes()和html_table()提取网站表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆