使用Ruby Mechanize刮取所有后续页面 [英] Use Ruby Mechanize to scrape all successive pages

查看：203 发布时间：2020/5/8 1:08:34 ruby mechanize scrape

本文介绍了使用Ruby Mechanize刮取所有后续页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找最好的方法来循环浏览网站上的连续页面，同时从每个页面抓取相关数据.

I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

例如，我想去一个特定站点(在下面的示例中为craigslist)，从第一页抓取数据，转到下一页，抓取所有相关数据，依此类推，直到最后一页.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

在我的脚本中，我正在使用while循环，因为它对我来说似乎最有意义.但是，它似乎工作不正常，仅从第一页抓取数据.

In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

熟悉Ruby/Mechanize的人可以向我指出完成此任务的最佳方法是正确的方向吗?我花了无数个小时试图弄清楚这一点，并觉得我缺少一些非常基本的东西.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

预先感谢您的帮助.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
    next_link = page.link_with(:dom_class => "button next")  
    page.css('ul.rows').map do |d|  
        property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
        property_results.push(property_hash)    
    end  
    page = next_link.click
end

更新: 我找到了，但仍然没有骰子:

UPDATE: I found this, but still no dice:

@pguardiario

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new 
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
  page.css('ul.rows').map do |d|  
    property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
    property_results.push(property_hash)
  end
    link = page.at('[rel=next]')
    page = agent.get link[:href]
 end
pry(binding)

使用Ruby Mechanize刮取所有后续页面 [英] Use Ruby Mechanize to scrape all successive pages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Ruby Mechanize刮取所有后续页面 [英] Use Ruby Mechanize to scrape all successive pages

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭