机械化getaddrinfo错误 [英] getaddrinfo error with Mechanize

查看:110
本文介绍了机械化getaddrinfo错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个脚本,该脚本将遍历数据库中的所有客户,验证其网站URL是否有效,并尝试在其主页上找到一个Twitter链接.我们有超过10,000个网址可供验证.在验证了一部分网址之后,我们开始为每个网址获取getaddrinfo错误.

I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

以下是刮取单个URL的代码的副本:

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

注意:我还运行了此代码的一个版本,该版本创建了一个Mechanize实例,该实例在对scrape_url的所有调用中共享.它以完全相同的方式失败了.

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

当我在EC2上运行此命令时,它会准确访问几乎1000个网址,然后针对剩余的9000多个网址返回此错误:

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

请注意,我尝试同时使用Amazon的DNS服务器和Google的DNS服务器,认为这可能是合法的DNS问题.在这两种情况下,我得到的结果完全相同.

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

然后,我尝试在本地MacBook Pro上运行它.在剩下的记录中返回此错误之前,它只经历了大约250次:

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

有人知道我如何才能通过所有记录来制作脚本吗?

Does anyone know how I can get the script to make it through all of the records?

推荐答案

我找到了解决方案. Mechanize将连接保持打开状态,并依靠GC对其进行清理.在确定的时间点之后,有足够的开放连接,因此无法建立其他出站连接来进行DNS查找.这是使它起作用的代码:

I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

通过将keep_alive设置为false,可以立即关闭并清除连接.

By setting keep_alive to false, the connection is immediately closed and cleaned up.

这篇关于机械化getaddrinfo错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆