试图建立与ZeroMQ分布式爬虫 [英] Trying to build a distributed crawler with ZeroMQ

查看:215
本文介绍了试图建立与ZeroMQ分布式爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始学ZeroMQ,并希望建立一个分布式的WebCrawler作为一个例子,而式学习。

I just started to learn ZeroMQ and want to build a distributed webcrawler as an example while learing.

我的想法是有一个服务器,用PHP写的,它接受一个网址,其中爬行应该开始。

My idea is to have a "server", written in PHP, which accepts a url where the crawling should start.

工人(C#CLI)将不得不抓取网址,提取链接,并把它们放回插入堆栈服务器。服务器保持在栈发送URL工人。
也许Redis的将跟踪所有抓取的网址,所以我们不抓取网站多次,要提取当前进程的统计数据的能力。

Workers (C# cli) will have to crawl that url, extract links, and push them back into a stack on the server. The server keeps sending urls in the stack to workers. Perhaps a redis will keep track of all crawled urls, so we dont crawl sites multiple times and have the ability to extract statistics of the current process.

我想有服务器,以均匀分布的任务,要知道新/失踪人员,并重新分配URL时工人犯规响应。

I would like to have the server to distribute tasks evenly, be aware of new/missing workers and redistribute urls when a worker doesnt respond.

为什么PHP服务器的我只是很舒服,使用PHP,仅此而已。我不想让这个例子/测试项目更复杂

Why PHP for the server: i'm just very comfortable with PHP, that is all. I dont want to make the example/testing project more complicated.

为什么C#的爪牙。的,因为它运行在大多数Windows机器。我可以给可执行程序,它可以直接执行它,并帮我测试一下我的项目不同的朋友。

Why C# for the minions: because it runs on most windows machines. I can give the executable to various friends which can just execute it and help me test my project.

抓取过程和Redis的功能是不是我的问题的一部分。

The crawling process and redis functionality are not part of my question.

我的第一个办法是推/拉模式,它通常适合我的情况,但是,这不是意识到它的爪牙。我想我需要一个经销商/路由器经纪人在中间,必须处理为自己的工人意识

My first approach was the PUSH/PULL pattern, which generally works for my scenario, but isnt aware of it's minions. I think i need a DEALER/ROUTER broker in the middle and have to handle the worker-awareness for myself.

我发现的这个问题但我真的不知道,如果我知道答案......

I found this question but i'm not really sure if i understand the answer...

我问了一些提示如何impement的ZMQ的东西。是经销商的做法是否正确?有没有办法让一个自动工人意识?我想我需要一些资源/例子,或者你认为我需要的只是ZMQ指南中越挖越深?

I'm asking for some hints how to impement the zmq stuff. Is the dealer approach correct? Is there any way to get an automatic worker-awareness? I think I need some resources/examples, or do you think that i just need to dig deeper in the zmq guide?

然而,朝着正确方向迈出的一些提示会巨大的:)

However, some hints towards the right direction would be great :)

干杯

推荐答案

我建立一个工作/任务的经销商,工程与您的履带,在校长,至少。这里有几件事情我已经学会了:

I'm building a job/task distributor that works the same as your crawler, in principal, at least. Here's a few things I've learned:

服务器和爬虫之间的通信将基于对不同的东西在你的系统中发生的事情,比如从服务器到履带式调度工作,或履带式发送心跳消息到服务器。定义系统的事件类型;他们是用例:

Communication between server and crawlers will be based on different things happening in your system, such as dispatching work from server to crawler, or a crawler sending a heartbeat message to the server. Define the system's event types; they are the use cases:

DISPATCH_WORK_TO_CRAWLER_EVENT
CRAWLER_NODE_STATUS_EVENT
...



定义消息标准



服务器和爬虫之间的所有通信应该做的事使用ZMsg的,所以定义你的组织框架的标准,是这样的:

Define a Message Standard

All communication between server and crawlers should be done using ZMsg's, so define a standard that organizes your frames, something like this:

Frame1: "Crawler v1.0"             //this is a static header
Frame2: <event type>               //ex: "CRAWLER_NODE_STATUS_EVENT"
Frame3: <content xml/json/binary>  //content that applies to this event (if any)

现在,您可以创建消息验证器来验证ZMsgs //内容同行之间收到的,因为你有一个标准约定,所有邮件都必须遵循。

Now you can create message validators to validate ZMsgs received between peers since you have a standard convention all messages must follow.

使用一个 ROUTER 服务器用于与爬虫 - 异步和双向沟通上。此外,使用 PUB 插座,用于广播心跳消息。

Use a single ROUTER on the server for asynchrounous and bidirectional communication with the crawlers. Also, use a PUB socket for broadcasting heartbeat messages.

不要在路由器插槽上挡,可使用 POLLER 来循环每5秒或什么的,这使得服务器定期做其他的事情,比如广播心跳事件的爬虫;是这样的:

Don't block on the ROUTER socket, use a POLLER to loop every 5s or whatever, this allows the server to do other things periodically, like broadcast heartbeat events to the crawlers; something like this:

Socket rtr = .. //ZMQ.ROUTER
Socket pub = .. //ZMQ.PUB  
ZMQ.Poller poller = new ZMQ.Poller(2)
poller.register( rtr, ZMQ.Poller.POLLIN)                               
poller.register( pub, ZMQ.Poller.POLLIN)

  while (true) {
     ZMsg msg = null            
     poller.poll(5000)

     if( poller.pollin(0)){
        //messages from crawlers                         
        msg = ZMsg.recvMsg(rtr)
     }

     //send heartbeat messages
     ZMsg hearbeatMsg = ...
     //create message content here,
     //publish to all crawlers
     heartbeatMsg.send(pub)
  }

要解决的有关工作的认识问题,一个简单有效的方法使用与心跳消息沿着FIFO堆栈;是这样的:

To address your question about worker awareness, a simple and effective method uses a FIFO stack along with the heartbeat messages; something like this:


  • 服务器在内存中维护一个简单的FIFO堆栈

  • 服务器发送心跳;爬虫它们的节点名称作出回应;路由器会自动将邮件中的节点的地址,以及
  • $(在消息包围读了) b $ b
  • 推1对象到包含节点名称和节点地址栈

  • 当服务器需要派遣工作提高到一个履带,刚刚从栈中弹出的下一个对象,创建消息和地址是正确(使用节点地址),和关闭它进入该工人

  • 分派更多的工作,以其他检索器相同的方式;当履带响应返回给服务器,只需按与节点名称/地址返回堆栈上的另一个对象;直到他们的反应,其他工人将无法使用,所以我们不打扰他们。

  • server maintains a simple FIFO stack in memory
  • server sends out heartbeats; crawlers respond with their node name; the ROUTER automatically puts the address of the node in the message as well (read up on message enveloping)
  • push 1 object onto the stack containing the node name and node address
  • when the server wants to dispatch work to a crawler, just pop the next object from the stack, create the message and address is properly (using the node address), and off it goes to that worker
  • dispatch more work to other crawlers the same way; when a crawler responds back to the server, just push another object with node name/address back on the stack; the other workers won't be available until they respond, so we don't bother them.

这是一个简单而有效的方法基于对工人的可用性,而不是盲目地派出工作分配的工作。检查 lbbroker.php 例如,概念是相同的。

This is a simple but effective method of distributing work based on worker availability instead of blindly sending out work. Check lbbroker.php example, the concept is the same.

工人应使用一个单一的经销商 SUB沿插座。在经销商是异步通信主插口,副预订来自服务器的心跳消息。当工人收到心跳消息,它响应给经销商插槽上的服务器。

The worker should use a single DEALER socket along with a SUB. The DEALER is the main socket for async communication, and the SUB subscribes to heartbeat messages from the server. When the worker receives a heartbeat messages, it responds to the server on the DEALER socket.

Socket dlr = .. //ZMQ.DEALER
Socket sub = .. //ZMQ.SUB
ZMQ.Poller poller = new ZMQ.Poller(2)
poller.register( dlr, ZMQ.Poller.POLLIN)                               
poller.register( sub, ZMQ.Poller.POLLIN)

  while (true) {
     ZMsg msg = null            
     poller.poll(5000)

     if( poller.pollin(0)){
        //message from server                         
        msg = ZMsg.recvMsg(dlr)
     }

     if( poller.pollin(1)){
      //heartbeat message from server
       msg = ZMsg.recvMsg(sub)
       //reply back with status
       ZMsg statusMsg = ...
       statusMsg.send(dlr)
  }

你可以计算出的其余你自己。通过 PHP 实例的工作,建立的东西,打破它,建立更多,这是你的唯一途径将学习!

The rest you can figure out on your own. Work through the PHP examples, build stuff, break it, build more, it's the only way you'll learn!

玩得开心,希望它能帮助!

Have fun, hope it helps!

这篇关于试图建立与ZeroMQ分布式爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆