抓取网页并通过单击按钮进行导航 [英] Scrape a webpage and navigate by clicking buttons

查看:114
本文介绍了抓取网页并通过单击按钮进行导航的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在服务器端执行以下操作:

I want to perform following actions at the server side:

1)抓取网页
2)模拟在该页面上的单击,然后导航到新页面.
3)刮开新页面
4)模拟新页面上的某些按钮点击
5)通过json或其他

1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something

我正在考虑将其与Node.js一起使用.

I am thinking of using it with Node.js.

但是对于我应该使用哪个模块感到困惑
a)僵尸
b)Node.io
c)Phantomjs
d)JSDOM
e)其他

But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else

我已经安装了io,但是无法通过命令提示符运行它.

I have installed node,io but am not able to run it via command prompt.

PS:我正在Windows 2008 Server中工作

PS: I am working in windows 2008 server

推荐答案

Zombie.js和Node.io在JSDOM上运行,因此,您的选择要么与JSDOM(或任何等效的包装器)一起使用,要么是无头浏览器(PhantomJS,SlimerJS) )或Cheerio.

Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.

  • JSDOM相当慢,因为它必须在Node.js中重新创建DOM和CSSOM.
  • PhantomJS/SlimerJS是合适的无头浏览器,因此性能还可以,而且非常可靠.
  • Cheerio 是JSDOM的轻量级替代方案.它不会在Node.js中重新创建整个页面(它只是下载并解析DOM-不执行任何javascript).因此,您无法真正单击按钮/链接,但是抓取网页的速度非常快.
  • JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
  • PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
  • Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.

鉴于您的要求,我可能会选择无头浏览器之类的东西.特别是,我会选择 CasperJS ,因为它具有出色的表达性API,并且快速可靠(它不会需要像JSDOM一样重新研究如何解析和渲染dom或css),并且与按钮和链接等元素进行交互非常容易.

Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.

您在CasperJS中的工作流程应该大致像这样:

Your workflow in CasperJS should look more or less like this:

casper.start();

casper
  .then(function(){
    console.log("Start:");
  })
  .thenOpen("https://www.domain.com/page1")
  .then(function(){
    // scrape something
    this.echo(this.getHTML('h1#foobar'));
  })
  .thenClick("#button1")
  .then(function(){
    // scrape something else
    this.echo(this.getHTML('h2#foobar'));
  })
  .thenClick("#button2")
  thenOpen("http://myserver.com", {
    method: "post",
    data: {
        my: 'data',
    }
  }, function() {
      this.echo("data sent back to the server")
  });

casper.run(); 

这篇关于抓取网页并通过单击按钮进行导航的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆