最好的web抓取Ruby on Rails库,处理JavaScript生成的动态HTML [英] Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript
问题描述
例如,沃尔玛的类别(在这种情况下,它是健康)是不可浏览的。我相信这是因为它是动态生成的HTML(例如从javascript)。为了抓住这一点,我需要一个浏览器来处理网络请求。
http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376
我也在Amazon EC2上使用一个linux机器。安装浏览器以刮擦UI将是困难的。有没有可以帮助我的Rails gem / plugin?
谢谢,所有!!
您的问题,转载,是什么是简单的方法来解析HTML文档的 DOM 以与浏览器相同的方式,然后在解析的DOM中执行文档中的JavaScript?没有运行实际的网络浏览器。
这有点棘手。
然而,一切都不会丢失。看看 Capybara 。虽然为验收测试创建,您也可以使用它来进行一般的文档挖掘。要执行JavaScript,您需要使用支持它的驱动程序,并且因为您希望它是无头(无浏览器GUI),这可能意味着使用 capybara-webkit , Akephalos 或 capybara-envjs 。
另一个选项可能是 Harmony ,我一无所知,除了它似乎做你想要的,但似乎还没有被维护,所以YMMV。
I am using Ruby on Rails with the Mechanize library to scrape store websites. The problem is that many times I can't crawl certain elements. However, I can see this when I 'view source' on the site.
For example, Walmart's category (in this case below it is "Health") is unscapeable. I believe this is because it is dynamically produced HTML (e.g. from javascript). In order to scrape this, I need a browser to process the web request.
http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376
I am also using a linux machine on Amazon EC2. It would be tough to install browser for UI scraping. Is there any Rails gem/plugin that can help me?
Thanks, all!!
Your question, rephrased, is, what's an easy way to parse an HTML document's DOM in the same way a web browser would, then execute the JavaScript in the document against the parsed DOM? Without running an actual web browser.
That's a little tricky.
However, all is not lost. Take a look at Capybara. Though created for acceptance testing you can also use it for general grokking of documents. To execute JavaScript you'll need to use a driver that supports it, and since you want it to be "headless" (no browser GUI) that probably means using capybara-webkit, Akephalos or capybara-envjs.
Another option might be Harmony, which I know nothing about except that it appears to do what you want but also appears not to be maintained anymore, so YMMV.
这篇关于最好的web抓取Ruby on Rails库,处理JavaScript生成的动态HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!