最好的web抓取Ruby on Rails库，处理JavaScript生成的动态HTML [英] Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript

查看：134 发布时间：2017/7/22 13:50:49 html ruby-on-rails dynamic rubygems web-scraping

本文介绍了最好的web抓取Ruby on Rails库，处理JavaScript生成的动态HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Ruby on Rails与Mechanize库来刮擦商店网站。问题是很多时候我无法抓取某些元素。但是，当我在网站上查看来源时，我可以看到这一点。

例如，沃尔玛的类别（在这种情况下，它是健康）是不可浏览的。我相信这是因为它是动态生成的HTML（例如从javascript）。为了抓住这一点，我需要一个浏览器来处理网络请求。

http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376

我也在Amazon EC2上使用一个linux机器。安装浏览器以刮擦UI将是困难的。有没有可以帮助我的Rails gem / plugin？

谢谢，所有!!

解决方案

您的问题，转载，是什么是简单的方法来解析HTML文档的 DOM 以与浏览器相同的方式，然后在解析的DOM中执行文档中的JavaScript？没有运行实际的网络浏览器。

这有点棘手。

然而，一切都不会丢失。看看 Capybara 。虽然为验收测试创建，您也可以使用它来进行一般的文档挖掘。要执行JavaScript，您需要使用支持它的驱动程序，并且因为您希望它是无头（无浏览器GUI），这可能意味着使用 capybara-webkit ， Akephalos 或 capybara-envjs 。

另一个选项可能是 Harmony ，我一无所知，除了它似乎做你想要的，但似乎还没有被维护，所以YMMV。

I am using Ruby on Rails with the Mechanize library to scrape store websites. The problem is that many times I can't crawl certain elements. However, I can see this when I 'view source' on the site.

For example, Walmart's category (in this case below it is "Health") is unscapeable. I believe this is because it is dynamically produced HTML (e.g. from javascript). In order to scrape this, I need a browser to process the web request.

http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376

I am also using a linux machine on Amazon EC2. It would be tough to install browser for UI scraping. Is there any Rails gem/plugin that can help me?

Thanks, all!!

解决方案

Your question, rephrased, is, what's an easy way to parse an HTML document's DOM in the same way a web browser would, then execute the JavaScript in the document against the parsed DOM? Without running an actual web browser.

That's a little tricky.

However, all is not lost. Take a look at Capybara. Though created for acceptance testing you can also use it for general grokking of documents. To execute JavaScript you'll need to use a driver that supports it, and since you want it to be "headless" (no browser GUI) that probably means using capybara-webkit, Akephalos or capybara-envjs.

Another option might be Harmony, which I know nothing about except that it appears to do what you want but also appears not to be maintained anymore, so YMMV.

这篇关于最好的web抓取Ruby on Rails库，处理JavaScript生成的动态HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

最好的web抓取Ruby on Rails库，处理JavaScript生成的动态HTML [英] Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

最好的web抓取Ruby on Rails库，处理JavaScript生成的动态HTML [英] Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭