使用Ruby Mechanize刮取所有后续页面 [英] Use Ruby Mechanize to scrape all successive pages

查看:203
本文介绍了使用Ruby Mechanize刮取所有后续页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找最好的方法来循环浏览网站上的连续页面,同时从每个页面抓取相关数据.

I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

例如,我想去一个特定站点(在下面的示例中为craigslist),从第一页抓取数据,转到下一页,抓取所有相关数据,依此类推,直到最后一页.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

在我的脚本中,我正在使用while循环,因为它对我来说似乎最有意义.但是,它似乎工作不正常,仅从第一页抓取数据.

In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

熟悉Ruby/Mechanize的人可以向我指出完成此任务的最佳方法是正确的方向吗?我花了无数个小时试图弄清楚这一点,并觉得我缺少一些非常基本的东西.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

预先感谢您的帮助.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
    next_link = page.link_with(:dom_class => "button next")  
    page.css('ul.rows').map do |d|  
        property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
        property_results.push(property_hash)    
    end  
    page = next_link.click
end 


更新: 我找到了,但仍然没有骰子:


UPDATE: I found this, but still no dice:

@pguardiario

@pguardiario

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new 
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
  page.css('ul.rows').map do |d|  
    property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
    property_results.push(property_hash)
  end
    link = page.at('[rel=next]')
    page = agent.get link[:href]
 end
pry(binding)

推荐答案

每当您看到[rel=next]时,这就是您要遵循的内容:

Whenever you see a [rel=next], that's the thing you want to follow:

page = agent.get url
do_something_with page
while link = page.at('[rel=next]')
  page = agent.get link[:href]
  do_something_with page
end

这篇关于使用Ruby Mechanize刮取所有后续页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆