从 R 中的网站中提取 html 表 [英] Extracting html table from a website in R
问题描述
您好,我正在尝试从 premierleague
网站中提取表格.
Hi I am trying to extract the table from the premierleague
website.
我使用的包是rvest
包,我在初始阶段使用的代码如下:
The package I am using is rvest
package and the code I am using in the inital phase is as follows:
library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")
我找不到可以提取 rvest 包的 html_nodes
的 html 标记.
I couldn't find a html tag that would work to extract the html_nodes
for rvest package.
我使用类似的方法从http://admissions.calpoly.edu 中提取数据/prospective/profile.html",我能够提取数据.我用于calpoly的代码如下:
I was using similar approach to extract data from "http://admissions.calpoly.edu/prospective/profile.html" and I was able to extract the data. The code I used for calpoly is as follows:
library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")
CPadmissions %>% html_nodes("table") %>%
.[[1]] %>%
html_table()
通过此链接从 youtube 获得上述代码:https:///www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien
Got the code above from youtube through this link: https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien
非常感谢您对从 Fantasy.premierleague.com 获取数据的任何帮助.我需要使用某种 API 吗?
Any help on getting data from fantasy.premierleague.com is highly appreciated. Do I need to use some kind of API ?
推荐答案
由于数据是用 JavaScript 加载的,所以用 rvest 抓取 HTML 不会得到你想要的,但是如果你在 RSelenium 中使用 PhantomJS 作为无头浏览器,并不是那么复杂(按照 RSelenium 标准):
Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\S+\s+(\S+)', '\1', names(.))) %>% # clean column names
setNames(gsub('\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1" "GW2" "GW3" "GW4" ...
## $ Gameweek_Points : int 34 47 53 51 66 66 65 63 48 90 ...
## $ Points_Bench : int 1 6 9 7 14 2 9 3 8 2 ...
## $ Gameweek_Rank : chr "2,406,373" "2,659,789" "541,258" "905,524" ...
## $ Transfers_Made : int 0 0 2 0 3 2 2 0 2 0 ...
## $ Transfers_Cost : int 0 0 0 0 4 4 4 0 0 0 ...
## $ Overall_Points : chr "34" "81" "134" "185" ...
## $ Overall_Rank : chr "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
## $ Value : chr "£100.0" "£100.0" "£99.9" "£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
与往常一样,需要进行更多清洁,但总体而言,它的状态非常好,无需太多工作.(如果您使用 tidyverse,df %>% mutate_if(is.character, parse_number)
会做得很好.)箭头是图像,这就是为什么最后一列都是 不适用
,但您无论如何都可以计算这些.
As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number)
will do pretty well.) The arrows are images which is why the last column is all NA
, but you can calculate those anyway.
这篇关于从 R 中的网站中提取 html 表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!