具有RVest和NHL统计信息的CSS选择器问题 [英] Css selector issue with rvest and NHL statistics

查看:71
本文介绍了具有RVest和NHL统计信息的CSS选择器问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从hockey-reference.com抓取数据,特别是从以下链接中抓取数据:

I want to scrape data from hockey-reference.com, specifically from this link:

https://www.hockey-reference.com/leagues/NHL_1991.html

我想要第四个表,称为 Team Statistics,并且我也想减去第一行和最后一行(但这可以是另一次)。

I want the 4th table, called "Team Statistics," and I also want to subtract the first and last rows (but that can be for another time).

最初,我希望使用1991年的链接进行抓取,但最终还是要抓取1991年至2017年的每个链接。

Initially, I want to get the scrape working with the 1991 link, but I want to eventually scrape every link from 1991 to 2017.

library(tidyverse)
library(rvest)

stat_urls <- "https://www.hockey-reference.com/leagues/NHL_1991.html"

现在,为简单起见,我只有1991年的链接。在使用实际网页的检查源进行了非常彻底的搜索之后,即使我尝试了多种不同的选择,我似乎也找不到正确的CSS选择。我已经尝试了以下css选择:

Right now, I have the 1991 link only, for simplicity. I cannot seem to find the correct css selection, even though I have tried multiple different ones, after a pretty thorough search using the "inspect" source of the actual webpage. I have tried the following css selections:

table#stats.sortable.stats_table.now.sortable
#stats
#all_stats
#all_stats > div.table_outer_container
#stats
#stats > tbody
#div_stats (and all sorts of combos with this one)

这些都不是工作,在以下代码中使用时:

None of these work, when used in the following code:

team_stats <- stat_urls %>% 
 read_html() %>%
 html_nodes("#stats") %>% 
 html_table(header = T)

所有使用 xpath =的尝试也都失败了。任何与此有关的帮助都是绝对惊人的,

All attempts with "xpath=" also failed. Any help with this would be absolutely phenomenal, and Go Preds!

推荐答案

您可以尝试使用RSelenium。在这里看到了类似的答案:使用R的网络抓取篮球参考

You can try using RSelenium. Saw a similar answer here: Web Scraping Basketball Reference using R.

library(rvest)
library(RSelenium)
startServer() 
remDr<-remoteDriver(browserName = "chrome")
remDr$open()

remDr$navigate("https://www.hockey-reference.com/leagues/NHL_1991.html")
page <- read_html(remDr$getPageSource()[[1]])
table <- html_table(page, fill = TRUE)
table[[28]]

虽然安装硒很痛苦,我也会尝试提供帮助,但是我前一段时间安装了硒,所以不要记得。祝你好运

It's a pain to install selenium though and I would try to help with that too but I installed it a while ago so don't really remember. Good luck

来自发布原始问题的人:

From the guy who posted the original question:

以上答案有效,但我必须经过Homebrew:

The above answer worked, but I had to go through Homebrew:

https://brew.sh/

然后我不得不从这里使用以下代码:

And then I had to use the following code from here:

在Mac Chrome上使用Selenium

# download selenium jar
curl -L0 https://selenium-release.storage.googleapis.com/3.9/selenium- 
server-standalone-3.9.1.jar -o selenium-server-standalone.jar

# install chromedriver
brew install chromedriver

# start chrome driver
brew services start chromedriver                                                                                                                                                                      
#==> Successfully started `chromedriver` 
(label:homebrew.mxcl.chromedriver)

# start selenium server
java -jar selenium-server-standalone.jar                                                                                                                                                                           
#14:38:20.684 INFO - Selenium build info: version: '3.9.1', revision: 
'63f7b50'
#14:38:20.685 INFO - Launching a standalone Selenium Server on port 
4444

这篇关于具有RVest和NHL统计信息的CSS选择器问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆