可以解释JavaScript的Web爬虫 [英] Web crawler that can interpret JavaScript

查看:102
本文介绍了可以解释JavaScript的Web爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个可以解释JavaScript的网络抓取工具。基本上它是一个Java或PHP程序,它将URL作为输入并输出DOM树,类似于Firebug HTML窗口中的输出。最好的例子是Kayak.com,当您查看源代码时,您无法在浏览器上看到生成的DOM,但可以通过Firebug保存生成的HTML。

I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.

我将如何做到这一点?有哪些工具可以帮助我?

How would I go about doing this? What tools exist that would help me?

推荐答案

Ruby的 Capybara 是一个集成测试库,但它也可以用来编写独立的Web爬虫。鉴于它使用像Selenium或无头WebKit这样的后端,它可以解释开箱即用的javascript:

Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:

require 'capybara/dsl'
require 'capybara-webkit'

include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)

这篇关于可以解释JavaScript的Web爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆