为分布式系统构建数据收集和监控的中间件 [英] Middleware to build data-gathering and monitoring for a distributed system

查看:20
本文介绍了为分布式系统构建数据收集和监控的中间件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在寻找一个好的中间件来构建监控和维护系统的解决方案.我们的任务是监控、收集和维护由多达 10,000 个独立节点组成的分布式系统.

I am currently looking for a good middleware to build a solution to for a monitoring and maintenance system. We are tasked with the challenge to monitor, gather data from and maintain a distributed system consisting of up to 10,000 individual nodes.

系统以 5-20 个节点为一组集群.每个组通过处理传入的传感器数据来生成数据(作为一个团队).每个组都有一个专用节点(蓝色框)作为组的外观/代理,将组中的数据和状态暴露给外界.这些集群在地理上是分开的,并且可以通过不同的网络连接到外部世界(一个可能通过光纤运行,另一个通过 3G/卫星).我们很可能会经历更短(秒/分钟)和更长(小时)的中断.数据由每个集群本地持久化.

The system is clustered into groups of 5-20 nodes. Each group produces data (as a team) by processing incoming sensor data. Each group has a dedicated node (blue boxes) acting as a facade/proxy for the group, exposing data and state from the group to the outside world. These clusters are geographically separated and may connect to the outside world over different networks (one may run over fiber, another over 3G/Satellite). It is likely we will experience both shorter (seconds/minutes) and longer (hours) outages. The data is persisted by each cluster locally.

这些数据需要由外部和中央服务器(绿色框)供各种客户端(橙色框)进一步处理、分析和查看.此外,我们需要通过每个组代理节点监控所有节点的状态.不需要直接监视每个节点,即使中间件可以支持它会很好(处理来自约 10,000 个节点的心跳/状态消息).在代理失败的情况下,可以使用其他方法来查明单个节点.

This data needs to be collected (continuously and reliably) by external & centralized server(s) (green boxes) for further processing, analysis and viewing by various clients (orange boxes). Also, we need to monitor the state of all nodes through each groups proxy node. It is not required to monitor each node directly, even though it would be good if the middleware could support that (handle heartbeat/state messages from ~10,000 nodes). In case of proxy failure, other methods are available to pinpoint individual nodes.

此外,我们需要能够与每个节点进行交互以调整设置等,但这似乎更容易解决,因为这主要是在需要时手动处理每个节点.可能需要进行一些批量调整,但总而言之,它看起来像是标准的 RPC 情况(Web 服务或类似情况).当然,如果中间件也可以通过一些请求/响应机制来处理这个问题,那将是一个加分项.

Furthermore, we need to be able to interact with each node to tweak settings etc. but that seems to be more easily solved since that is mostly manually handled per-node when needed. Some batch tweaking may be needed, but all-in-all it looks like a standard RPC situation (Web Service or alike). Of course, if the middleware can handle this too, via some Request/Response mechanism that would be a plus.

要求:

  • 1000 多个节点发布/提供连续数据
  • 数据需要可靠(以某种方式)并持续收集到一台或多台服务器.这很可能建立在中间件之上,使用某种明确的请求/响应来请求丢失的数据.如果这可以由中间件自动处理,那当然是一个加分项.
  • 多个服务器/订阅者需要能够连接到同一个数据生产者/发布者并接收相同的数据
  • 数据速率最大为每组每秒 10-20 次
  • 消息大小从大约 100 字节到 4-5 KB 不等
  • 节点范围从嵌入式受限系统到普通 COTS Linux/Windows 机器
  • 节点一般使用C/C++,服务端和客户端一般使用C++/C#
  • 节点应该(最好)不需要安装额外的软件或服务器,即每个节点一个专用的代理或额外的服务是昂贵的
  • 安全性将基于消息,即不需要传输安全性

我们正在寻找一种解决方案,该解决方案可以处理主要代理节点(蓝色)和服务器(绿色)之间的通信以进行数据发布/轮询/下载,以及从客户端(橙色)到单个节点(RPC 样式)以调整设置.

We are looking for a solution that can handle the communication between primarily proxy nodes (blue) and servers (green) for the data publishing/polling/downloading and from clients (orange) to individual nodes (RPC style) for tweaking settings.

似乎有很多关于逆转情况的讨论和建议;将数据从服务器分发到许多客户端,但很难找到与所描述情况相关的信息.一般的解决方案似乎是使用 SNMP、Nagios、Ganglia 等来监控和修改大量节点,但对我们来说棘手的部分是数据收集.

There seems to be a lot of discussions and recommendations for the reversed situation; distributing data from server(s) to many clients, but it has been harder to find information related to the described situation. The general solution seems to be to use SNMP, Nagios, Ganglia etc. to monitor and modify large number of nodes, but the tricky part for us is the data gathering.

我们简要介绍了 DDS、ZeroMQ、RabbitMQ(所有节点都需要代理?)、SNMP、各种监控工具、Web 服务(JSON-RPC、REST/协议缓冲区)等解决方案.

We have briefly looked at solutions like DDS, ZeroMQ, RabbitMQ (broker needed on all nodes?), SNMP, various monitoring tools, Web Services (JSON-RPC, REST/Protocol Buffers) etc.

那么,对于一个易于使用、强大、稳定、轻量级、跨平台、跨语言的中间件(或其他)解决方案,您有什么建议可以满足您的需求吗?尽可能简单但不简单.

So, do you have any recommendations for an easy-to-use, robust, stable, light, cross-platform, cross-language middleware (or other) solution that would fit the bill? As simple as possible but not simpler.

推荐答案

似乎 ZeroMQ 可以轻松满足要求,无需管理中央基础设施.由于您的监控服务器是固定的,因此解决问题确实非常简单.0MQ 指南中的这一部分可能会有所帮助:

Seems ZeroMQ will fit the bill easily, with no central infrastructure to manage. Since your monitoring servers are fixed, it's really quite a simple problem to solve. This section in the 0MQ Guide may help:

http://zguide.zeromq.org/page:all#分布式记录和监控

您提到了可靠性",但您能否指定您想要恢复的实际故障集?如果您使用的是 TCP,那么根据定义,网络已经是可靠的"了.

You mention "reliability", but could you specify the actual set of failures you want to recover? If you are using TCP then the network is by definition "reliable" already.

这篇关于为分布式系统构建数据收集和监控的中间件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆