使用 Rvest 抓取包含多个表的 URL [英] Scrape a URL with several tables with Rvest

查看:36
本文介绍了使用 Rvest 抓取包含多个表的 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习如何使用 rvest 包进行一些抓取.我正在使用这个

当我尝试加载信息时,我所能得到的只是第一个表.我的意思是,当我使用谷歌浏览器进行检查时,我看到表格中的数字被标记为 class="right".所以这就是我尝试过的:

库(rvest)图书馆(字符串)url = url(https://www.basketball-reference.com/players/l/leonaka01.html")读取 = html_nodes(read_html(url),'.正确的')read2 = str_replace_all(html_text(read),[\r\n\t]", "")

我看到的是 read 是一个包含 351 个值的列表.好的,那就是他检测到 351 个标记为正确的值.如果我得到最后一个,read2[351],我看到29.3"这是第一个表的最后一个值.

那么...我怎样才能获得有关其他表的信息?我从未告诉 R 获取第一个表,我想我会获取所有表的所有信息,下一步是过滤高级"表.表值不知何故.

问候

解决方案

高级"表隐藏在注释下,因此不能直接访问.我们可以使用 xpath 将所有注释放在一起,然后从中解析表.

库(rvest)url = "https://www.basketball-reference.com/players/l/leonaka01.html"网址 %>%read_html %>%html_nodes(xpath = '//comment()') %>%html_text() %>%toString() %>%read_html() %>%html_node('table#advanced') %>%html_table()# 季节年龄 Tm Lg Pos G MP PER TS% 3PAR FTr ORB% ...#1 2011-12 20 SAS NBA SF 64 1534 16.6 0.573 0.270 0.218 7.9 ...#2 2012-13 21 SAS NBA SF 58 1810 16.4 0.592 0.331 0.240 4.3 ...#3 2013-14 22 SAS NBA SF 66 1923 19.4 0.602 0.282 0.195 4.6 ...#4 2014-15 23 SAS NBA SF 64 2033 22.0 0.567 0.234 0.307 4.8 ...#5 2015-16 24 SAS NBA SF 72 2380 26.0 0.616 0.267 0.306 4.7 ...#6 2016-17 25 SAS NBA SF 74 2474 27.6 0.610 0.295 0.406 3.7 ...#7 2017-18 26 SAS NBA SF 9 210 26.0 0.572 0.315 0.342 3.1 ...#8 2018-19 27 TOR NBA SF 60 2040 25.8 0.606 0.267 0.377 4.2 ...#9 2019-20 28 LAC NBA SF 6 183 35.1 0.572 0.230 0.319 5.5 ...#10 职业生涯 NA NBA 473 14587 22.8 0.599 0.276 0.318 4.8 ...#11 呐呐呐呐呐呐呐呐呐呐...#12 7 个赛季 NA SAS NBA 407 12364 22.1 0.597 0.279 0.305 4.8 ...#13 1 赛季 NA TOR NBA 60 2040 25.8 0.606 0.267 0.377 4.2 ...#14 1 赛季 NA LAC NBA 6 183 35.1 0.572 0.230 0.319 5.5 ...

I am trying to learn how to do some scraping using rvest package. I´m using this url to load the information, and I am trying to get the information of the table marked as "advanced" in the URL:

When I try to load the information, all I´m able to get is the first table. I mean, when I inspect using google chrome I see that the numbers in the table are marked as class="right". So this is what I tried:

library(rvest)
library(stringr)

url = url("https://www.basketball-reference.com/players/l/leonaka01.html")

read = html_nodes(read_html(url),
                         '.right')

read2 = str_replace_all(html_text(read), 
                     "[\r\n\t]" , "")

What I see is that read is a list of 351 values. Ok, that is he detected 351 values marked as right. If I get the last one, read2[351], I see "29.3" which is the last value of the first table.

So... how can I get the information about the other tables? I have never told R to get the first table, I supposed that I´d get all the information of all the tables and my next step would be to filter the "Advanced" table values somehow.

Regards

解决方案

The "Advanced" table is hidden under comments, hence it isn't directly accessible. We can get all the comments together using xpath and then parse the table from it.

library(rvest)
url = "https://www.basketball-reference.com/players/l/leonaka01.html"

url %>%
  read_html %>%
  html_nodes(xpath = '//comment()') %>%
  html_text() %>%
  toString() %>%
  read_html() %>%
  html_node('table#advanced') %>%
  html_table() 

#      Season Age  Tm  Lg Pos   G    MP  PER   TS%  3PAr   FTr ORB% ...
#1    2011-12  20 SAS NBA  SF  64  1534 16.6 0.573 0.270 0.218  7.9 ...
#2    2012-13  21 SAS NBA  SF  58  1810 16.4 0.592 0.331 0.240  4.3 ...
#3    2013-14  22 SAS NBA  SF  66  1923 19.4 0.602 0.282 0.195  4.6 ...
#4    2014-15  23 SAS NBA  SF  64  2033 22.0 0.567 0.234 0.307  4.8 ...
#5    2015-16  24 SAS NBA  SF  72  2380 26.0 0.616 0.267 0.306  4.7 ...
#6    2016-17  25 SAS NBA  SF  74  2474 27.6 0.610 0.295 0.406  3.7 ...
#7    2017-18  26 SAS NBA  SF   9   210 26.0 0.572 0.315 0.342  3.1 ...
#8    2018-19  27 TOR NBA  SF  60  2040 25.8 0.606 0.267 0.377  4.2 ...
#9    2019-20  28 LAC NBA  SF   6   183 35.1 0.572 0.230 0.319  5.5 ...
#10    Career  NA     NBA     473 14587 22.8 0.599 0.276 0.318  4.8 ...
#11            NA              NA    NA   NA    NA    NA    NA   NA ...
#12 7 seasons  NA SAS NBA     407 12364 22.1 0.597 0.279 0.305  4.8 ...
#13  1 season  NA TOR NBA      60  2040 25.8 0.606 0.267 0.377  4.2 ...
#14  1 season  NA LAC NBA       6   183 35.1 0.572 0.230 0.319  5.5 ...

这篇关于使用 Rvest 抓取包含多个表的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆