循环访问多个URL以使用Nokogori解析HTML [英] Iterating through multiple URLs to parse HTML with Nokogori
问题描述
我想做的是使用Nokogiri刮擦多个供应商的商品名称和价格.我使用方法参数将CSS选择器(传递给查找名称和价格)传递给了Nokogiri.
What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.
关于如何将多个URL传递到"scrape"方法同时传递其他参数(例如:vendor,item_path)的任何指导?还是我要以一种完全错误的方式来解决这个问题?
Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?
这是代码:
require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI
@@collection = Array.new # Array to hold meta hash
def scrape(url, vendor, item_path, name_path, price_path)
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.each do |item| # Iterates through each item on grid
@@collection << meta = Hash.new # Creates a new hash then add to global array
meta[:vendor] = vendor
meta[:name] = item.css(name_path).text.strip
meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join
end
end
scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B", "#items", ".productname", ".price")
推荐答案
在第二个示例中,您可以按照已经执行的相同方式传递多个url's
:
You can pass multiple url's
the same way you're already doing it in you second example:
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B", "#items", ".productname", ".price")
您的scrape
方法将必须遍历那些urls
,例如:
Your scrape
method will have to iterate through those urls
, for instance:
def scrape(urls, vendor, item_path, name_path, price_path)
urls.each do |url|
doc = Nokogiri::HTML(open(url)) # Opens URL
items = doc.css(item_path) # Sets items
items.each do |item| # Iterates through each item on grid
@@collection << meta = Hash.new # Creates a new hash then add to global array
meta[:vendor] = vendor
meta[:name] = item.css(name_path).text.strip
meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join
end
end
end
这也意味着第一个示例也需要作为数组传递:
This also means that the first example need also be passed as an array:
scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")
这篇关于循环访问多个URL以使用Nokogori解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!