Tomcat中的高可用性单例处理器 [英] high availability singleton processor in Tomcat

查看:114
本文介绍了Tomcat中的高可用性单例处理器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个针对RDBMS的作业处理分析服务,由于需要复杂的缓存,因此缓存更新逻辑在高可用性集群中必须是单例.作业作为JMS消息(通过ActiveMQ)来了.它是通过Web前端托管在HA Tomcat群集中的应用程序的一部分.

I have a job processing analytic service working against RDBMS that, due to the need for complex caching and cache update logic needs to be a singleton in a high availability cluster. Jobs are coming as JMS messages (via ActiveMQ). It is part of the application hosted in HA Tomcat cluster with web front end.

问题是,如果服务所在的节点发生故障,则服务本身需要能够在几秒钟内恢复.故障可能意味着系统停机或只是CPU速度缓慢-即,如果节点在CPU延迟后恢复,但处理已移交,则它将无法继续进行.

The problem is, the service itself needs to be able to recover within seconds if a node where it is running fails. Failure could mean system down or just a slow CPU - i.e. if node recovers after CPU delay, but the processing is handed over, it cannot continue.

根据经验,什么是最合适的解决方案:

From experience, what would be the most suitable solution here:

  • 在每个作业开始之前基于数据库的锁和锁检查(在这里我不能轻易提出防弹解决方案-有什么建议吗?)
  • 某种Paxos算法?您是否知道任何用于此目的的苗条框架,因为算法本身需要花费一些时间才能正确进行然后进行质量检查?
  • 还有什么?

我不介意故障恢复速度是否很慢,但是我想将每个作业的开销降到最低.

I don't mind if failure recovery is slow, but I would want to minimize an overhead for each job.

一些其他背景:作业只涉及从数据库中读取数据,使用各种算法对其进行按摩(有点类似于寻找最短路径)并为不同参与者继续提供最佳解决方案.演员与现实世界进行交互并回馈一些反馈,基于这些反馈,同一工作处理器可以优化后续步骤.

Some additional background: job does not involve anything more than reading data from the database, massaging it with various algorithms (somewhat resembling finding shortest routes) and putting back optimal solutions for different actors to move on. Actors interact with real world and put back some feedback, based on which consequent steps are optimized by the same job processor.

推荐答案

使用Hazelcast的解决方案

Tomasz提出的"Hazelcast锁定"方法.您需要仔细阅读文档,使用限时锁并确保监控您的单身人士续订租约.要记住的一件事是,Hazelcast是为在大型群集中工作而编写的-因此,其启动时间相对较慢,即使对于两个节点,其启动时间也为1至5秒.但是在执行完这些操作之后,锁定将花费毫秒.通常这并不重要,但是故障/恢复周期需要时间,因此应将其视为特殊情况.

Solution Using Hazelcast

Hazelcast locking method proposed by Tomasz works. You need to read documentation carefully, use time leased locks and ensure monitoring of your singleton to renew leases. One thing to keep in mind is that Hazelcast was written to work in large clusters - as such its start up time is relatively slow, 1 to 5 seconds even for two nodes. But after that operations are qute performant and obtainng the lock takes milliseconds. Normally it all does not matter, but failure/recovery cycle takes time and it should be treated as exceptional situation.

此解决方案必须具有防防弹功能,因此存在一些局限性.如果群集已拆分(节点之间的网络中断),但是每个节点都处于活动状态并且可以访问数据库,则无法确定地知道如何进行.最终,您需要在这里考虑一个应急计划.在现实生活中,对于典型的故障转移HA设置,这种情况极不可能发生.

There are limits to this solution being buletproof. If the cluster is split (network disruption between nodes) but each node is alive and has access to the database, there is no way of knowing deterministically how to proceed. Ultimately, you need to think about a contingency plan here. In real life this scenario is very unlikely for a typical failover HA setup.

最终,在诉诸具有分布式锁定的解决方案之前,请认真考虑使您的过程不那么单调.并行运行某些程序可能仍然很困难,但是要确保缓存不陈旧或找到其他防止数据库损坏的方法可能并不那么困难.在我的情况下,有一个数据库事务计数器像优化锁一样工作.代码在做出所有决定之前先读取它,然后在存储结果的事务中的db和cache中更新它的位置.如果出现差异,将清除高速缓存并重复操作.它使两个并行运行的节点的速度变慢,但是却可以防止数据损坏.通过使用事务计数器存储其他数据,您也许能够优化高速缓存刷新策略,并逐渐转向并行处理.

At the end of the day, before resorting to a solution with distributed locking, think hard about making your process not-so-singleton. It might still be hard to run certain things in parallel, but it might not be so hard to ensure the cache is not stale or find other ways to prevent database corruption. In my case, there is a database transaction counter working like optimisitic lock. Code reads it before making all the decisions and update-where's it in both, db and cache in the transaction where the result is stored. In case of discrepancy cache is purged and operation repeated. It makes two nodes working in parallel impossibly slow, but it prevents data corruption. By storing additional data with the transaction counter you might be able to optimize cache refresh strategies and slowly move towards parallel processing.

这就是我下次如何处理此类请求的方式.

This is how I would proceed about such a request next time.

  1. 尝试使您的单身人士在不同节点上并行工作而生存
  2. 再试一次,也许有办法编排它们
  3. 检查是否可以使用HASingleton或类似技术来 避免样板
  4. 实施上述概述的Hazelcast解决方案
  1. Try making your singletons survive working in parallel on different nodes
  2. Try again, maybe there is a way to orchestrate them
  3. Check if it is possible to use HASingleton or similar technology to avoid boilerplate
  4. Implement the Hazelcast solution as outlined above

在此处发布代码没有任何意义,因为最耗时的部分是测试和验证所有故障场景和应急计划.几乎没有样板,代码本身将始终是特定于解决方案的.几天之内就可以提出涵盖所有基础的运作良好的PoC.

It makes no sense to post the code here as the most time consuming part is to test and verify all failure scenarios and contingency plans. There is almost no boilerplate, the code itself will always be solution specific. It is possible to come up with well working PoC covering all the bases within couple of days.

这篇关于Tomcat中的高可用性单例处理器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆