难倒如何从这个站点抓取数据(使用 R) [英] stumped on how to scrape the data from this site (using R)
问题描述
我正在尝试使用 R 从该站点抓取数据:http://www.soccer24.com/kosovo/superliga/results/#
I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#
我可以执行以下操作:
library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")
但是我对如何最终获取数据感到困惑.这是因为网站上的实际数据似乎是由 Javascript 生成的.我能做的是
but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is
html_text(doc)
但这会产生一长串奇怪的文本(其中确实包含数据,但散布着奇怪的代码,我完全不清楚我将如何解析它.
but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.
我想提取的是所有比赛的比赛数据(日期、时间、球队、结果).此站点不需要其他数据.
What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.
谁能提供一些有关如何从该站点提取数据的提示?
Can anyone provide some hints as to how to extract that data from this site?
推荐答案
使用 Selenium
和 phantomjs
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)
如果你想按下更多数据按钮直到它不可见(所有匹配都假定显示):
if you want to press the more data button until it is not visible (all matches presumed showing):
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
webElem$clickElement()
Sys.sleep(5)
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])
删除不需要的圆形数据并使用 XML::readHTMLTable
为简单起见
Remove unwanted round data and use XML::readHTMLTable
for simplicity
# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank Date hteam ateam score
1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0
2 01.04. 18:00 Istogu Hajvalia 2 : 1
3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4 01.04. 18:00 Prishtina Drenica 3 : 0
5 31.03. 18:00 Besa Peje Drita 1 : 0
6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0
> tail(appData)
blank Date hteam ateam score
115 17.08. 22:00 Besa Peje Trepca 89 3 : 3
116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5
117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0
118 17.08. 22:00 Vellaznimi Drenica 2 : 1
119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1
120 16.08. 22:00 Prishtina Istogu 2 : 1
根据需要进行进一步的格式化.
carry out further formatting as needed.
这篇关于难倒如何从这个站点抓取数据(使用 R)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!