如何在R中抓取一个活的java脚本网页? [英] How to scrape a live java script webpage in R?

查看:276
本文介绍了如何在R中抓取一个活的java脚本网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过播放从 http://stats.statbroadcast.com / statmonitr /?id = 107165 。链接将带您到分割框选项卡。我有兴趣通过播放选项卡,以及主页统计和访问者统计选项卡刮游戏。其中一个问题是,无论什么标签,你切换到url从来没有改变。如果使用选择器小工具,css-selector的所有选项卡的主要内容也是一样,这是#stats。我是一个新手在网页抓取,大多数时候,我可以成功地用包 rvest 刮掉一个html页面,但我不幸失去了如何继续使用javascript 。我听说过JSON,但我不知道如何解决所有的标签具有相同的url的问题。



我的主要目标是能够在游戏进行时通过玩法,主场统计数据和访问者统计数据标签来刮除游戏。



任何帮助将非常感激。

解决方案

您可以使用 RSelenuim 如下:

  require(RSelenium)
RSelenium :: startServer()
remDr < - remoteDriver()
remDr $ open()
remDr $ navigate(http://stats.statbroadcast.com/statmonitr/?id=107165)

现在一个firefox窗口应该打开,你可以像正常一样浏览。
doc< - remDr $ getPageSource()为您提供当前网页的源代码。您可以使用 rvest 按如下方式处理此代码:

  doc< ;  -  remDr $ getPageSource()[[1]] 
require(rvest)
current_doc< - read_html(doc)

如果你想自动化浏览,你可以eg。导航到Play by Play-Page,如下所示:

  webElem< ;- remDr $ findElement(using =css selector ,'#bb_b6')
remDr $ mouseMoveToLocation(webElement = webElem)
remDr $ click(1)

$ b b

结束时:关闭远程驱动程序并关闭selenium-server

  #shutdown 
remDr $ close()
browseURL(http:// localhost:4444 / selenium-server / driver /?cmd = shutDownSeleniumServer)

有关详情,请参阅:
https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html



修改
current_doc 可以执行 doc< - remDr $ getPageSource()[[1]] 。它不是一个实时喜欢。



如果您要划掉Period I,请执行以下操作:
第一次导航到Play by Play上面) - Sys.sleep(3),直到网站加载 - 然后导航到期间I,就像导航到Play by Play



如果您到达Period I网页,请查看您的远程驱动程序(也就是您控制的浏览器窗口)。



到达后执行 doc< - remDr $ getPageSource()[[1]]


I would like to scrape the play by play from http://stats.statbroadcast.com/statmonitr/?id=107165. The link will bring you to the "Split Box" tab. I am interested in scraping the play by play tab as well as the home stats and the visitors stats tab. One of the problems is that no matter what tab you switch to the url never changes. If you use selector gadget the css-selector for the main contents of all the tabs are the same as well, which is "#stats". I am a novice at web scraping and most of the time I can successfully scrape a html page with the package rvest, but I am unfortunately lost as to how I should proceed with javascript. I have heard of JSON, but I am not sure how to combat the issue of all the tabs having the same url.

My main goal is to be able to scrape the play by play, home stats, and visitor stats tab when the game is live.

Any help would be much appreciated. Please let me know if I should provide more info.

解决方案

You can use RSelenuim to do that as follows:

require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://stats.statbroadcast.com/statmonitr/?id=107165")

Now a firefox window should open where you can browse just like normal. doc <- remDr$getPageSource() gives you the Source-Code of the current webpage. You can use rvest to scrape this Code as follows:

doc <- remDr$getPageSource()[[1]]
require(rvest)
current_doc <- read_html(doc)

If you want to automate the "browsing" you can eg. navigate to the "Play by Play"-Page as follows:

webElem <- remDr$findElement(using = "css selector", '#bb_b6')
remDr$mouseMoveToLocation(webElement = webElem)
remDr$click(1)

At the end: close the remote driver ans shut down selenium-server

#shutdown
remDr$close()
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")

For more details see: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

Edit: current_doc caputures the website as it is when you execute doc <- remDr$getPageSource()[[1]]. It is NOT a realtime like. It is a 1 time picture.

If you want to scrape "Period I" do as follows: 1st navigate to "Play by Play" (as shown above) - Sys.sleep(3) till the website is loaded - Then navigate to "Period I" the same way you navigated to "Play by Play" just with another css-selector.

Have a look at your remote-driver (aka the browser window you control) if you arrived at the "Period I" webpage.

After you arrived execute doc <- remDr$getPageSource()[[1]] and analyse the content.

这篇关于如何在R中抓取一个活的java脚本网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆