刮动态页面内容phantomjs [英] Scraping dynamic page content phantomjs

查看:143
本文介绍了刮动态页面内容phantomjs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的公司正在使用一个托管我们所有常见问题和客户问题的网站。我们计划通过并清除所有旧数据并输入新数据,并且该服务没有备份或归档选项,以避免我们不想再出现的问题。

我经历过并尝试使用perl和机械化来浏览网站,但我错过了页面上的客户评论,因为它们是通过ajax加载的。我已经看过phantomjs,并可以使用示例页面将页面保存为图像,但是,我希望获取页面的完整页面html转储,但无法弄清楚方法。我在我们的网站上使用了这个示例代码

  var page = new WebPage(); 

page.open('http://espn.go.com/nfl/',function(status){
//一旦加载页面,包含来自cdn $ j $ b的jQuery page.includeJs(http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js,function(){
//一旦jQuery加载,运行一些代码
//将我们的自定义文本插入页面
page.evaluate(function(){$(h2)。html('许多NFL球员害怕乍得月亮会进入联赛');});
//截屏并退出
page.render('espn.png');
phantom.exit();

});

});

有没有使用phantomjs的方法,我只能得到数据的完整页面转储,类似于如果我在Chrome中查看源代码?我可以用perl + mechanize做到这一点,但是不知道如何使用phantomjs来做到这一点。

可以使用 page.content 以获取完整的HTML DOM


My company is using a website that hosts all of our FAQ and customer questions. We have plans to go through and wipe out all of the old data and input new and the service does not have a backup, or archive option for questions we don't want to appear anymore.

I've gone through and tried to scape the site using perl and mechanize, but I'm missing the customer comments on the page as they are loaded through ajax. I have looked at phantomjs and can get the pages to save to an image using an example page, however, I'd like to get an full page html dump of the page, but can't figure out how. I used this example code on our site

var page = new WebPage();

page.open('http://espn.go.com/nfl/', function (status) {
//once page loaded, include jQuery from cdn
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
//once jQuery loaded, run some code
//inserts our custom text into the page
page.evaluate(function(){$("h2").html('Many NFL Players Scared that Chad Moon Will Enter League');});
//take screenshot and exit
page.render('espn.png');
phantom.exit();

});

});

Is there a way using phantomjs that I can just get a full page dump of the data, similar to if I did a view source in chrome? I can do this with perl + mechanize, but don't see how to do this using phantomjs.

解决方案

You can use page.content to get the full HTML DOM

这篇关于刮动态页面内容phantomjs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆