刮取Javascript呈现的网页,该网页引用R中的外部JavaScript脚本 [英] Scraping Javascript-rendered webpage that references external javascript scripts in R

查看:109
本文介绍了刮取Javascript呈现的网页,该网页引用R中的外部JavaScript脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取此网页: https://www.mustardbet.com/sports /events/302698

I am trying to scrape this webpage: https://www.mustardbet.com/sports/events/302698

由于该网页似乎是动态呈现的,因此我正在关注本教程: https://www.datacamp. com/community/tutorials/scraping-javascript-generation-data-with-r#gs.dZEqev8

Since the webpage seems to be rendered dynamically, I am following this tutorial: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8

如本教程所建议,我用以下代码保存了一个名为"scrape_mustard.js"的文件:

As the tutorial suggests, I save a file named "scrape_mustard.js" with the following code:

// scrape_mustard.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'mustard.html'

page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

然后,我表演

system("./phantomjs scrape_mustard.js")

但是我得到了错误:

ReferenceError: Can't find variable: Set

  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1

现在,当我粘贴" https://www.mustardbet.com /assets/js/index.dfd873fb.js "到我的浏览器中,我可以看到它是JavaScript,并且可能需要 (1)将其另存为文件,或 (2)将其包含在scrape_mustard.js中.

Now, when I paste "https://www.mustardbet.com/assets/js/index.dfd873fb.js" into my browser I can see that it's javascript, and that I probably need to either (1) save that as a file, or (2) include it in scrape_mustard.js.

但是如果(1),我不知道如何引用该新文件,如果(2),我不知道如何正确定义所有这些javascript,以便可以使用它.

But if (1), I don't know how to then reference that new file, and if (2), I don't know how to define all that javascript properly so that it can be used.

我是javascript的完全新手,但也许这个问题不太难?

I'm a complete newbie to javascript, but maybe this problem is not too difficult?

感谢您的帮助!

推荐答案

我能够使用js模块puppeteer.js进行抓取.

I was able to scrape using the js module puppeteer.js.

此处下载node.js. node.jsnpm一起提供,这使您在安装模块时更轻松.您需要使用npm安装puppeteer.

Download node.js here. node.js comes with npm which makes your life easier when comes to install modules. You need to install puppeteer using npm.

在RStudio中,安装puppeteer.js时请确保您位于工作目录中.一旦安装了node.js,请执行():

In RStudio, make sure you are on your working directory when you are installing puppeteer.js. Once node.js is installed, do (source):

system("npm i puppeteer")

scrape_mustard.js:

// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");

// page url
url = "https://www.mustardbet.com/sports/events/302698";

scrape = async() => {
    const browser = await puppeteer.launch({headless: false}); // open browser
    const page = await browser.newPage(); // open new page
    await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
    await page.waitFor(5000); // give it time to load all the javascript rendered content
    const html = await page.content(); // copy page contents
    browser.close(); // close chromium
    return html // return html object
};

scrape().then((value) => {
    fs.writeFileSync("./stackoverflow/page.html", value) // write the object being returned by scrape()
});

要在R中运行scrape_mustard.js:

library(magrittr)

system("node ./stackoverflow/scrape_mustard.js")

html <- xml2::read_html("./stackoverflow/page.html")

oddsMajor <- html %>% 
  rvest::html_nodes(".odds-major")

betNames <- html %>% 
  rvest::html_nodes("h3")

控制台输出:

{xml_nodeset (60)}
 [1] <span class="odds-major">2</span>
 [2] <span class="odds-major">14</span>
 [3] <span class="odds-major">15</span>
 [4] <span class="odds-major">16</span>
 [5] <span class="odds-major">17</span>
 [6] <span class="odds-major">23</span>
 [7] <span class="odds-major">25</span>
 [8] <span class="odds-major">32</span>
 [9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
 [1] <h3>Charles Howell III</h3>\n
 [2] <h3>Brian Harman</h3>\n
 [3] <h3>Austin Cook</h3>\n
 [4] <h3>J.J. Spaun</h3>\n
 [5] <h3>Webb Simpson</h3>\n
 [6] <h3>Cameron Champ</h3>\n
 [7] <h3>Peter Uihlein</h3>\n
 [8] <h3>Seung-Jae Im</h3>\n
 [9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...

我确信可以使用phantomjs完成此操作,但是我发现puppeteer更容易抓取javascript呈现的网页.另外请记住,phantomjs 不再被开发.

I am sure it can be done with phantomjs but I've found puppeteer easier to scrape javascript-rendered webpages. Also keep in mind that phantomjs is no longer being developed.

这篇关于刮取Javascript呈现的网页,该网页引用R中的外部JavaScript脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆