在 R 中的高尔夫网站上抓取排行榜表 [英] Scraping leaderboard table on golf website in R

查看:40
本文介绍了在 R 中的高尔夫网站上抓取排行榜表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PGA 巡回赛的网站有一个

其他页面,例如

...但是,使用排行榜 url 链接到 .json 文件 https://lbdata.pgatour.com/2021/r/005/leaderboard.json 没有帮助...相反,我在使用 jsonlite::fromJson

时收到此错误

那么两个问题:

  1. 是否可以将此 .JSON 文件读入 R?(也许它以某种方式受到保护)?也许只是我的一个问题,或者我在这里错过了 R 中的其他东西?

  2. 鉴于 URL 发生变化,如何在 R 中动态获取 URL 值?如果我能以某种方式获取所有 global.leaderboardConfig 对象,那就太好了,因为这样我就可以访问 leaderboardUrl.

谢谢!!

解决方案

如前所述,这个页面是由一些javascript动态生成的.
甚至 json 文件地址似乎是动态的,并且您尝试打开的地址不再有效:

https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e81211198427386ee075ce929f79d79f79d7e79d79d处理您的请求时发生错误.参考 #199.cf05d517.1613439313.4ed8cf21

要获取数据,您可以在安装 RSelenium 后使用 RSeleniuma href="https://docs.ropensci.org/RSelenium/articles/docker.html" rel="nofollow noreferrer">Docker Selenium 服务器.
安装很简单,Docker 旨在使图像开箱即用.

安装Docker后,运行Selenium服务器就这么简单:

docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0

请注意,这作为一个整体需要超过 2 Gb 的磁盘空间.

Selenium 模拟 Web 浏览器,并允许在渲染 javascript 之后获取页面的最终 HTML 内容:

库(RSelenium)图书馆(rvest)remDr <- 远程驱动程序(remoteServerAddr = "localhost",端口 = 4445L,browserName = "firefox";)# 打开与 Selenium 服务器的连接remDr$open()remDr$getStatus()remDr$navigate(https://www.pgatour.com/leaderboard.html")玩家 <- xml2::read_html(remDr$getPageSource()[[1]]) %>%html_nodes(.player-name-col")%>%html_text()总计 <- xml2::read_html(remDr$getPageSource()[[1]]) %>%html_nodes(.total") %>%html_text()data.frame(玩家 = 玩家,总计 = 总计 [-1])球员总数1 丹尼尔·伯杰 (PB) -182 特立独行的麦克尼利 (PB) -163 帕特里克·坎特利 (PB) -154 乔丹-斯皮思 (PB) -155 保罗凯西 (PB) -146 内特·莱斯利 (PB) -147 查理霍夫曼 (PB) -138 卡梅伦·特林格尔 (PB) -13...

由于表格不使用 table 标签,html_table 不起作用,需要单独提取列.

The PGA tour's website has a leaderboard page page and I am trying to scrape the main table on the website for a project.

library(dplyr)
leaderboard_table <- xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% 
  html_nodes('table') %>% 
  html_table()

however instead of pulling the tables, it returns this odd output...

Other pages such as the schedule page scrape fine without any issues, see below. It is only the leaderboard page I am having trouble with.

schedule_url <- 'https://www.pgatour.com/tournaments/schedule.html'
schedule_table <- xml2::read_html(schedule_url) %>% html_nodes('table.table-styled') %>% html_table()
schedule_df <- schedule_table[[1]]
# this works fine

Edit Before Bounty: the below answer is a helpful start, however there is a problem. The JSON files name changes based on the round (/r/003 for 3rd round) and probably based on other aspects of the golf tournament as well. Currently there is this that i see in the elements tab:

...however, using the leaderboard url link to the .json file https://lbdata.pgatour.com/2021/r/005/leaderboard.json is not helping... instead, I receive this error when using jsonlite::fromJson

Two questions then:

  1. Is is possible to read this .JSON file into R? (perhaps it is protected in some way)? Maybe just an issue on my end, or am I missing something else in R here?

  2. Given that the URL changes, how can I dynamically grab the URL value in R? It would be great if I could grab all of the global.leaderboardConfig object somehow, because that would give me access to the leaderboardUrl.

Thanks!!

解决方案

As already mentioned, this page is dynamically generated by some javascript.
Even the json file address seems to be dynamic, and the address you're trying to open isn't valid anymore :

https://lbdata.pgatour.com/2021/r/003/leaderboard.json?userTrackingId=exp=1612495792~acl=*~hmac=722f704283f795e8121198427386ee075ce41e93d90f8979fd772b223ea11ab9

An error occurred while processing your request.

Reference #199.cf05d517.1613439313.4ed8cf21 

To get the data, you could use RSelenium after installing a Docker Selenium server.
The installation is straight forward, and Docker is designed to make images work out of the box.

After Docker installation, running the Selenium server is as simple as:

docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0

Note that this as a whole requires over 2 Gb disk space.

Selenium emulates a Web browser and allows among others to get the final HTML content of the page, after rendering of the javascript:

library(RSelenium)
library(rvest)

remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)
# Open connexion to Selenium server
remDr$open()
remDr$getStatus()

remDr$navigate("https://www.pgatour.com/leaderboard.html")

players <- xml2::read_html(remDr$getPageSource()[[1]]) %>% 
                 html_nodes(".player-name-col")   %>% 
                 html_text()

total <- xml2::read_html(remDr$getPageSource()[[1]]) %>% 
               html_nodes(".total") %>%
               html_text()

data.frame(players = players, total = total[-1])

                     players total
1        Daniel Berger  (PB)   -18
2     Maverick McNealy  (PB)   -16
3      Patrick Cantlay  (PB)   -15
4        Jordan Spieth  (PB)   -15
5           Paul Casey  (PB)   -14
6         Nate Lashley  (PB)   -14
7      Charley Hoffman  (PB)   -13
8     Cameron Tringale  (PB)   -13
...

As the table doesn't use the table tag, html_table doesn't work and columns need to be extracted individually.

这篇关于在 R 中的高尔夫网站上抓取排行榜表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆