难倒如何从这个站点抓取数据(使用 R) [英] stumped on how to scrape the data from this site (using R)

查看:46
本文介绍了难倒如何从这个站点抓取数据(使用 R)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 R 从该站点抓取数据:http://www.soccer24.com/kosovo/superliga/results/#

I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#

我可以执行以下操作:

library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")

但是我对如何最终获取数据感到困惑.这是因为网站上的实际数据似乎是由 Javascript 生成的.我能做的是

but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is

html_text(doc)

但这会产生一长串奇怪的文本(其中确实包含数据,但散布着奇怪的代码,我完全不清楚我将如何解析它.

but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.

我想提取的是所有比赛的比赛数据(日期、时间、球队、结果).此站点不需要其他数据.

What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.

谁能提供一些有关如何从该站点提取数据的提示?

Can anyone provide some hints as to how to extract that data from this site?

推荐答案

使用 Seleniumphantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

如果你想按下更多数据按钮直到它不可见(所有匹配都假定显示):

if you want to press the more data button until it is not visible (all matches presumed showing):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

删除不需要的圆形数据并使用 XML::readHTMLTable 为简单起见

Remove unwanted round data and use XML::readHTMLTable for simplicity

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

根据需要进行进一步的格式化.

carry out further formatting as needed.

这篇关于难倒如何从这个站点抓取数据(使用 R)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆