使用Hpricot(Ruby on Rails)刮取隐藏的HTML(当可见= false时) [英] Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

查看:124
本文介绍了使用Hpricot(Ruby on Rails)刮取隐藏的HTML(当可见= false时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个不幸似乎无法超越的问题,不幸的是,我还只是一个Ruby on Rails的新生儿,因此不幸的是因此提出了很多问题。



我试图抓取如下网页:

  http://www.yellowpages.com.mt/ Malta / Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx 

我想刮去下一页的地址,电话和URL,在这种情况下是

  http://www.yellowpages.com。 mt / Malta / Grocers-Mini-Markets-Retail-In-Malta-Gozo + Ismol.aspx 

我一直在尝试我能想到的任何事情,但似乎没有任何工作,因为他们被设置为隐形或如此。



该地址位于 h3 标签内,但看起来不可废弃。我一直在寻找ScRUBYt从以下网址 http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/ ,但我真的不能找到在这种情况下如何应用它们的正面或反面。

我真的很感激任何一点,因为这是一个障碍,我真的需要超越,才能继续完成我的任务。预先感谢您的帮助。

解决方案

在给出的特定示例中,元素不隐藏,但在页面加载后通过ajax加载。所以基本上你需要的是一个可以运行javascript(web浏览器?)来查看这些地址和其他内容的http客户端。



如果你想真正实现自动化,取消通过ajax或javascript获取的数据,您可以尝试。尽管它不是为此目的而开发的,但它可以满足您的需求。


I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions

I am attempting to scrape a webpage such as the following:

http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx

I would like to scrape The Addresses, Phones and URL of the next Page which in this case is

http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx

I've been trying just about anything i could think of but nothing seems to work due to them being set to invisible or so.

The Address is within an h3 tag but it does not appear to be scrap-able. I've been also looking into ScRUBYt from the following url http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/, but i really cant seem to find heads or tails of how to apply them in this case.

I would really appreciate any pointers as this is an obstacle which i really need to surpass in order to move forward on my assignment. Thanks in advance for any help.

解决方案

In the particular example you have given, the elements are not hidden, but loaded via ajax after the page load. So basically what you need is a http client which can run javascript (web browser?) to see those address and other contents.

If you want to really automate the process and scrap the data which is got through ajax or javascript, you can try selenium. Even though it is not developed for that purpose, it serves your needs.

这篇关于使用Hpricot(Ruby on Rails)刮取隐藏的HTML(当可见= false时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆