使用Node.js刮取JavaScript生成的网站 [英] Scraping JavaScript-generated website with Node.js

查看:60
本文介绍了使用Node.js刮取JavaScript生成的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我解析静态html页面时,我的node.js应用程序运行良好。但是,如果该网址是JavaScript生成的页面,则该应用程序将无法运行。如何刮取JavaScript生成的网页?

When I parse a static html page, my node.js app works well. However, when the url is a JavaScript-generated page, the app doesn't work. How can I scrape a JavaScript-generated web page?

我的app.js

var express = require('express'),
  fs = require('fs'),
  request = require('request'),
  cheerio = require('cheerio'),
  app = express();

app.get('/scrape', function( req, res ) {

  url = 'http://www.apache.org/';

  request( url, function( error, response, html ) {
    if( !error ) {
      var $ = cheerio.load(html);

      var title, release, rating;
      var json = { title : "" };

      $('body').filter(function() {
        var data = $(this);
        title = data.find('.panel-title').text();
        json.title = title;
      })
    }

    fs.writeFile('output.json', JSON.stringify(json, null, 4), function(err) {
      console.log( 'File successfully written! - Check your project directory for the output.json file' );
    });

    // Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
    res.send( 'Check your console!' );
  });
});

app.listen('8081');
console.log('Magic happens on port 8081');
exports = module.exports = app;


推荐答案

Cheerio将不会执行页面上的javascript

Cheerio won't execute the javascript on the page as it's just made for parsing plain HTML.

我建议使用类似PhantomJS的方法: http://phantomjs.org/

I'd suggest a different approach using something like PhantomJS: http://phantomjs.org/

这篇关于使用Node.js刮取JavaScript生成的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆