循环访问多个URL以使用Nokogori解析HTML [英] Iterating through multiple URLs to parse HTML with Nokogori

查看:229
本文介绍了循环访问多个URL以使用Nokogori解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做的是使用Nokogiri刮擦多个供应商的商品名称和价格.我使用方法参数将CSS选择器(传递给查找名称和价格)传递给了Nokogiri.

What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.

关于如何将多个URL传递到"scrape"方法同时传递其他参数(例如:vendor,item_path)的任何指导?还是我要以一种完全错误的方式来解决这个问题?

Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?

这是代码:

require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI

@@collection = Array.new # Array to hold meta hash

def scrape(url, vendor, item_path, name_path, price_path)
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end
end

scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

推荐答案

在第二个示例中,您可以按照已经执行的相同方式传递多个url's:

You can pass multiple url's the same way you're already doing it in you second example:

scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

您的scrape方法将必须遍历那些urls,例如:

Your scrape method will have to iterate through those urls, for instance:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end 
  end   
end

这也意味着第一个示例也需要作为数组传递:

This also means that the first example need also be passed as an array:

scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")

这篇关于循环访问多个URL以使用Nokogori解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆