同时检查数据库中多个网址的网址(状态为200,301,404)的最佳方法 [英] Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

查看:107
本文介绍了同时检查数据库中多个网址的网址(状态为200,301,404)的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我想要完成的。假设我有100,000个URL存储在数据库中,我想检查其中的每个HTTP状态并存储该状态。我想能够在相当短的时间内同时进行。

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.

我想知道什么是最好的方法。我想过使用某种排队的工人/消费者或某种事件模型,但我真的没有足够的经验,知道什么是最好的在这种情况下。

I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.

想法?

推荐答案

看看非常有能力的 Typhoeus和Hydra 组合。两者使得很容易同时处理多个URL。

Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.

时间示例应该可以帮助您快速运行。在 on_complete 块中,您的代码将您的状态写入数据库。您可以使用线程来将排队的请求构建和维护在一个健康的级别,或者将一个设置的数字排队,让它们全部运行完成,然后循环另一个组。

The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.

Paul Dix,原作者,

Paul Dix, the original author, talked about his design goals on his blog.

这是一个非常有用的工具。我写了一些示例代码来下载存档邮件列表,以便我可以进行本地搜索。如果人们开始运行代码,我故意删除URL以防止网站遭受DOS攻击:

This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:

#!/usr/bin/env ruby

require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'

BASE_URL = ''

url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)

hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
  gzip_url = url.join(gzip)
  request = Typhoeus::Request.new(gzip_url.to_s)

  request.on_complete do |resp|
    gzip_filename = resp.request.url.split('/').last
    puts "writing #{gzip_filename}"
    File.open("gz/#{gzip_filename}", 'w') do |fo|
      fo.write resp.body
    end  
  end
  puts "queuing #{ gzip }"
  hydra.queue(request)
end

hydra.run

在我几岁的MacBook Pro上运行代码在76文件总共11MB在短短20秒,通过无线到DSL。如果你只是做 HEAD 请求,你的吞吐量会更好。你会想要混乱的并发设置,因为有一个点,有更多的并发会话只会减慢你和不必要地使用资源。

Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.

我给它一个8 of 10;

I give it a 8 out of 10; It's got a great beat and I can dance to it.

编辑:

检查删除网址时,您可以使用HEAD请求,或使用 If-Modified-Since 。他们可以提供您可以用来确定网址新鲜度的响应。

When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.

这篇关于同时检查数据库中多个网址的网址(状态为200,301,404)的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆