在Erlang群集中的所有节点上运行gen_server的最佳方式是什么? [英] What's the best way to run a gen_server on all nodes in an Erlang cluster?

查看:185
本文介绍了在Erlang群集中的所有节点上运行gen_server的最佳方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Erlang建立一个监控工具。在集群上运行时,应该在所有节点上运行一组数据收集功能,并在单个录音机节点上使用RRD记录数据。



当前版本在主节点( rolf_node_sup )上运行一个主管,该主节点尝试在群集中的每个节点上运行第二个管理员( rolf_service_sup )。每个节点上的主管应该启动并监视一堆进程,这些进程将消息发送回主节点上的gen_server( rolf_recorder )。



这只在本地工作。任何远程节点上都没有启动主管。我使用以下代码尝试加载on来自记录器节点的节点主管:

  rpc:call(Node,supervisor,start_child,[{global,rolf_node_sup}服务]])

我发现有几个人建议主管真的只为本地设计流程。例如





实现我的要求是在群集中的所有节点上运行监督代码的OTP方法是多少?




  • A分布式应用程序被建议为分布式主管树的一种替代方案。这些不符合我的用例。它们提供节点之间的故障切换,但是保持代码在一组节点上运行。

  • pool 模块很有趣。但是,它提供了在当前加载最少的节点上运行作业,而不是在所有节点上运行。

  • 或者,我可以创建一组监督的代理进程每个节点)在主节点上使用 proc_lib:spawn_link 在每个节点上启动一个主管。如果某个节点出现问题,则代理进程应该死机,然后由其主管重新启动,这又应该重新启动远程进程。 从站模块可能非常有用。

  • 或者也许我过于复杂。直接监督节点是一个坏主意,或者我应该设计应用程序以更松散的方式收集数据。通过在多个节点上运行应用程序来构建一个集群,告诉一个人是主人,然后离开它!


一些要求: p>


  • 架构应该能够处理节点加入和离开池,而无需人工干预。

  • I我想要建立一个单一的解决方案,至少在初期,为了简单起见。

  • 我希望在我的实现中使用现有的OTP设施。 / li>

解决方案

有趣的挑战,有多个解决方案。以下只是我的建议,希望能让您更好地选择如何编写程序。



据了解您的程序,您希望有一个您启动应用程序的主节点。这将在集群中的节点上启动Erlang VM。 模块使用从站模块来执行此操作,这需要在两个方向进行基于密钥的ssh通信。



slave 的缺点是如果主人死亡,那么做奴隶这是因为它可能完全符合原始用例,但是在您的情况下,这可能是愚蠢的(您可能希望仍然收集数据,即使主人关闭)



对于OTP应用程序,每个节点都可以运行相同的应用程序。在您的代码中,您可以使用配置或发现来确定集群中的节点角色。



我建议使用某些操作系统设施或守护进程工具启动Erlang VM。每个虚拟机将启动相同的应用程序,其中一个将作为主机启动,其余的作为从站。这样做的缺点在于,您可以像使用 slave 一样在集群中的机器上自动运行软件更为难易,但是它也更加强大。



在每个应用程序中,您可以根据节点的作用拥有一个合适的监督树。删除节点间监控和产卵使系统更简单。



我还建议将所有节点推送到主节点。这样,主人并不需要关心从机上发生了什么,甚至可能忽略节点关闭的事实。这也允许添加新节点而不改变主节点。 Cookie可以用作身份验证。多个主人或录音机也是比较容易的。



然而,从属节点需要注意主人下来,并采取适当的行动,如存储监视数据,以便在主站备份时稍后发送监视数据。


I'm building a monitoring tool in Erlang. When run on a cluster, it should run a set of data collection functions on all nodes and record that data using RRD on a single "recorder" node.

The current version has a supervisor running on the master node (rolf_node_sup) which attempts to run a 2nd supervisor on each node in the cluster (rolf_service_sup). Each of the on-node supervisors should then start and monitor a bunch of processes which send messages back to a gen_server on the master node (rolf_recorder).

This only works locally. No supervisor is started on any remote node. I use the following code to attempt to load the on-node supervisor from the recorder node:

rpc:call(Node, supervisor, start_child, [{global, rolf_node_sup}, [Services]])

I've found a couple of people suggesting that supervisors are really only designed for local processes. E.g.

What is the most OTP way to implement my requirement to have supervised code running on all nodes in a cluster?

  • A distributed application is suggested as one alternative to a distributed supervisor tree. These don't fit my use case. They provide for failover between nodes, but keeping code running on a set of nodes.
  • The pool module is interesting. However, it provides for running a job on the node which is currently the least loaded, rather than on all nodes.
  • Alternatively, I could create a set of supervised "proxy" processes (one per node) on the master which use proc_lib:spawn_link to start a supervisor on each node. If something goes wrong on a node, the proxy process should die and then be restarted by it's supervisor, which in turn should restart the remote processes. The slave module could be very useful here.
  • Or maybe I'm overcomplicating. Is directly supervising nodes a bad idea, instead perhaps I should architect the application to gather data in a more loosely coupled way. Build a cluster by running the app on multiple nodes, tell one to be master, leave it at that!

Some requirements:

  • The architecture should be able to cope with nodes joining and leaving the pool without manual intervention.
  • I'd like to build a single-master solution, at least initially, for the sake of simplicity.
  • I would prefer to use existing OTP facilities over hand-rolled code in my implementation.

解决方案

Interesting challenges, to which there are multiple solutions. The following are just my suggestions, which hopefully makes you able to better make the choice on how to write your program.

As I understand your program, you want to have one master node where you start your application. This will start the Erlang VM on the nodes in the cluster. The pool module uses the slave module to do this, which require key-based ssh communication in both directions. It also requires that you have proper dns working.

A drawback of slave is that if the master dies, so does the slaves. This is by design as it probably fit the original use case perfectly, however in your case it might be stupid (you may want to still collect data, even if the master is down, for example)

As for the OTP applications, every node may run the same application. In your code you can determine the nodes role in the cluster using configuration or discovery.

I would suggest starting the Erlang VM using some OS facility or daemontools or similar. Every VM would start the same application, where one would be started as the master and the rest as slaves. This has the drawback of marking it harder to "automatically" run the software on machines coming up in the cluster like you could do with slave, however it is also much more robust.

In every application you could have a suitable supervision tree based on the role of the node. Removing inter-node supervision and spawning makes the system much simpler.

I would also suggest having all the nodes push to the master. This way the master does not really need to care about what's going on in the slave, it might even ignore the fact that the node is down. This also allows new nodes to be added without any change to the master. The cookie could be used as authentication. Multiple masters or "recorders" would also be relatively easy.

The "slave" nodes however will need to watch out for the master going down and coming up and take appropriate action, like storing the monitoring data so it can send it later when the master is back up.

这篇关于在Erlang群集中的所有节点上运行gen_server的最佳方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆