是否可以使用Ruby和Nokogiri插入JavaScript引擎? [英] Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

查看:88
本文介绍了是否可以使用Ruby和Nokogiri插入JavaScript引擎?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个应用程序来抓取一些网站并从中抓取数据。我正在使用Ruby,Curl和Nokogiri来做这件事。在大多数情况下,它很简单,我只需要ping一个URL并解析HTML数据。设置完美无缺。

I'm writing an application to crawl some websites and scrape data from them. I'm using Ruby, Curl and Nokogiri to do this. In most cases it's straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine.

但是,在某些情况下,网站会根据某些单选按钮上的用户输入检索数据。这会调用一些JavaScript,从服务器获取更多数据。生成的URL和发布的数据由JavaScript代码确定。

However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code.

是否可以使用:


  1. 一个JavaScript库以及这个设置,可以确定在我的HTML页面中执行JavaScript吗?

  1. A JavaScript library along with this setup which would be able to determine execute the JavaScript in the HTML page for me?

分开从使用不同的库,是否有一些集成或HTML和JS库进行通信的方式?例如,如果单击一个按钮,Nokogiri需要调用JavaScript,然后JavaScript需要更新Nokogiri。

Apart from using a different library, is there some integration or a way for the HTML and JS libraries to communicate? For instance if a button is clicked, Nokogiri needs to call JavaScript and then the JavaScript needs to update Nokogiri.

In如果我的方法看起来不是最好的,那么你建议使用Ruby在网上构建一个crawler + scraper。

In case my approach doesn't seem the best, what would your suggestion be to build a crawler + scraper on the web using Ruby.

编辑:看起来第1点是可能的使用therubyrace,因为它在你的代码中嵌入了V8引擎,但有2个替代吗?

Looks like point 1 is possible using therubyrace as it embeds the V8 engine in your code, but is there an alternative to 2?

推荐答案

您正在寻找< a href =http://watir.com/ =noreferrer> Watir ,它运行一个真正的浏览器,允许您在网页上执行您能想到的每个操作。还有一个名为 Selenium 的类似项目。

You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.

你可以甚至在Linux机器上使用Watir和所谓的无头浏览器。

You can even use Watir with a so-called 'headless' browser on a linux machine.

假设我们有这个HTML:

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

此Javascript:

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(演示: http://jsbin.com/ivihur

您希望获得动态插入的文本。首先,你需要一个安装了 xvfb firefox 的Linux机器,例如在Ubuntu上执行:

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

您还需要 watir-webdriver 无头宝石,所以请继续安装它们:

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

然后你可以从页面上读取动态内容,如下所示:

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

如果一切顺利,这将输出:

If everything went right, this will output:

Hello from JavaScript

我知道这也在后台运行浏览器,但它是我能想到的最简单的问题解决方案。启动浏览器需要很长时间,但后续请求速度非常快。 (运行 goto 然后多次获取动态文本对我的Rackspace云服务器上的每个请求大约需要0.5秒。)

I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

资料来源: http://watirwebdriver.com/headless/

这篇关于是否可以使用Ruby和Nokogiri插入JavaScript引擎?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆