Ruby Mechanize,Nokogiri和Net :: HTTP [英] Ruby Mechanize, Nokogiri and Net::HTTP

查看:78
本文介绍了Ruby Mechanize,Nokogiri和Net :: HTTP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Net :: HTTP进行HTTP请求并获得响应:

I am using Net::HTTP for HTTP requests and getting a response back:

uri = URI("http://www.example.com")
http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
request = Net::HTTP::Get.new uri.request_uri
response = http.request request # Net::HTTPResponse object
body = response.body

如果我必须使用Nokogiri gem来解析此HTML响应,我将这样做:

If I have to use the Nokogiri gem in order to parse this HTML response I will do:

nokogiri_obj = Nokogiri::HTML(body)

但是,如果我想使用Mechanize gem,我需要这样做:

But if I want to use Mechanize gem I need to do this:

agent = Mechanize.new
mechanize_obj = agent.get("http://www.example.com")

我是否可以使用Net :: Http获取HTML响应,然后使用Mechanize gem将其转换为Mechanize对象,而不是使用agent.get()?

Is it possible for me to use Net::Http for getting the HTML response and then use the Mechanize gem to convert it into a Mechanize object instead of using agent.get()?

绕开agent.get()方法的原因是因为我试图使用EventMachine::Iterator发出并发的EM-HTTP请求.

The reason for getting around the agent.get() method is because I am trying to use EventMachine::Iterator to make concurrent EM-HTTP requests.

EventMachine.run do
  EM::Iterator.new(urls, 3).each do |url,iter|
    puts "giving   #{url}   to httprequest now"
    http = EM::HttpRequest.new(url).get
    http.callback { |resp|
      uri = resp.send(:URI, url)
      puts "inside callback of #{url}"
      body = resp.response
      page = agent.parse(uri, resp, body)
    }
    iter.next
  end
end

但是它不起作用.我遇到错误:

But its not working. I am getting an error:

/usr/local/rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:1165:in`parse': undefined method `[]' for #<EventMachine::HttpClient:0x0000001c18eb30> (NoMethodError)

当我将parse方法用于Net::HTTP时,它可以正常工作,并且得到Mechanize对象:

when I use the parse method for Net::HTTP it works fine and I get the Mechanize object:

 uri = URI("http://www.example.com")
 http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
 request = Net::HTTP::Get.new uri.request_uri
 response = http.request request # Net::HTTPResponse object
 body = response.body
 agent = Mechanize.new
 page = agent.parse(uri, response, body)     

在使用em-http时,我是否为parse方法传递了错误的参数?

Am I passing the wrong arguments for the parse method while using em-http?

推荐答案

我不确定您为什么认为使用Net :: HTTP会更好. Mechanize将处理重定向和cookie,并提供对Nokogiri的已解析文档的现成访问.

I'm not sure why you think using Net::HTTP would be better. Mechanize will handle redirects and cookies, plus provides ready access to Nokogiri's parsed document.

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.example.com')

# Use Nokogiri to find the content of the <h1> tag...
puts page.at('h1').content # => "Example Domains"

请注意,无需设置user_agent即可访问example.com.

Note, setting the user_agent isn't necessary to reach example.com.

如果要使用线程引擎检索页面,请查看 Typhoeous and Hydra .

If you want to use a threaded engine to retrieve pages, take a look at Typhoeous and Hydra.

这篇关于Ruby Mechanize,Nokogiri和Net :: HTTP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆