最好的web抓取Ruby on Rails库,处理JavaScript生成的动态HTML [英] Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript

查看:134
本文介绍了最好的web抓取Ruby on Rails库,处理JavaScript生成的动态HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Ruby on Rails与Mechanize库来刮擦商店网站。问题是很多时候我无法抓取某些元素。但是,当我在网站上查看来源时,我可以看到这一点。



例如,沃尔玛的类别(在这种情况下,它是健康)是不可浏览的。我相信这是因为它是动态生成的HTML(例如从javascript)。为了抓住这一点,我需要一个浏览器来处理网络请求。



http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376



我也在Amazon EC2上使用一个linux机器。安装浏览器以刮擦UI将是困难的。有没有可以帮助我的Rails gem / plugin?



谢谢,所有!!

解决方案

您的问题,转载,是什么是简单的方法来解析HTML文档的 DOM 以与浏览器相同的方式,然后在解析的DOM中执行文档中的JavaScript?没有运行实际的网络浏览器。



这有点棘手。



然而,一切都不会丢失。看看 Capybara 。虽然为验收测试创建,您也可以使用它来进行一般的文档挖掘。要执行JavaScript,您需要使用支持它的驱动程序,并且因为您希望它是无头(无浏览器GUI),这可能意味着使用 capybara-webkit Akephalos capybara-envjs



另一个选项可能是 Harmony ,我一无所知,除了它似乎做你想要的,但似乎还没有被维护,所以YMMV。


I am using Ruby on Rails with the Mechanize library to scrape store websites. The problem is that many times I can't crawl certain elements. However, I can see this when I 'view source' on the site.

For example, Walmart's category (in this case below it is "Health") is unscapeable. I believe this is because it is dynamically produced HTML (e.g. from javascript). In order to scrape this, I need a browser to process the web request.

http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376

I am also using a linux machine on Amazon EC2. It would be tough to install browser for UI scraping. Is there any Rails gem/plugin that can help me?

Thanks, all!!

解决方案

Your question, rephrased, is, what's an easy way to parse an HTML document's DOM in the same way a web browser would, then execute the JavaScript in the document against the parsed DOM? Without running an actual web browser.

That's a little tricky.

However, all is not lost. Take a look at Capybara. Though created for acceptance testing you can also use it for general grokking of documents. To execute JavaScript you'll need to use a driver that supports it, and since you want it to be "headless" (no browser GUI) that probably means using capybara-webkit, Akephalos or capybara-envjs.

Another option might be Harmony, which I know nothing about except that it appears to do what you want but also appears not to be maintained anymore, so YMMV.

这篇关于最好的web抓取Ruby on Rails库,处理JavaScript生成的动态HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆