可以解释JavaScript的Web爬虫 [英] Web crawler that can interpret JavaScript
问题描述
我想编写一个可以解释JavaScript的网络抓取工具。基本上它是一个Java或PHP程序,它将URL作为输入并输出DOM树,类似于Firebug HTML窗口中的输出。最好的例子是Kayak.com,当您查看源代码时,您无法在浏览器上看到生成的DOM,但可以通过Firebug保存生成的HTML。
I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.
我将如何做到这一点?有哪些工具可以帮助我?
How would I go about doing this? What tools exist that would help me?
推荐答案
Ruby的 Capybara 是一个集成测试库,但它也可以用来编写独立的Web爬虫。鉴于它使用像Selenium或无头WebKit这样的后端,它可以解释开箱即用的javascript:
Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:
require 'capybara/dsl'
require 'capybara-webkit'
include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)
这篇关于可以解释JavaScript的Web爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!