如何阻止一个节点上的死锁使整个群集崩溃? [英] How to stop a deadlock on one node from crashing entire cluster?

查看:142
本文介绍了如何阻止一个节点上的死锁使整个群集崩溃?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在MariaDB下运行3x节点Galera群集.该应用程序是使用mysqli扩展名的PHP.

I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.

偶尔我会收到死锁写入.我正在努力改进应用程序,以处理或避免这种故障,但是与此同时,我需要集群在发生这种情况时保持正常运行.

Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.

问题在于,一旦发生死锁,不仅集群中的一个节点崩溃,而且所有三个节点也会崩溃.死锁发生的节点遭受 MySQL服务器已消失错误,并且max_connect_errors开始永久拒绝连接后,因此需要手动重新启动服务器.

The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors starts refusing connections permanently, thus requiring a manual server restart.

我不明白的是为什么其他节点也会掉线.它们都以"WSREP尚未为应用程序准备节点"开始出错,这意味着整个应用程序崩溃,并且没有数据库节点接受连接.

What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.

当一个节点遭受罕见的死锁时,如何确保群集的其余部分保持正常运行?

How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?

更新:

一个月后,另一个僵局导致了类似的问题.同样,一个节点会破坏一切.

A month later and another deadlock causes a similar problem. Again, one node brings down everything.

第一个连接遇到死锁(在提交阶段),因此应用程序尝试重播事务.这挂了将近一分钟,然后再次失败.

The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.

在第一个连接恢复失败后,所有其他连接开始失败,并显示(1205)超出了锁定等待超时",从而使整个群集无用.

After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.

我应该补充一点,该应用程序不使用锁.但是,它本身却陷入了困境,只是与常规的事务查询有关.

I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.

推荐答案

我正在回答自己的问题,因为我设法避免了崩溃.但是,我仍然遇到次要错误的问题,并已开始新线程并提供详细信息.

I'm answering my own question as I've managed to avoid crashes. However, I still have problems with secondary errors and have started a new thread with the specifics.

我的恢复代码现在以不同的方式处理次要错误.它将重试几次死锁,但仅当错误是死锁时才重试.如果发生任何其他类型的错误,应用程序将放弃.

My recovery code now handles secondary errors differently. It will retry deadlocks a couple of times, but only while the error is a deadlock. If any other type of error occurs the application will give up.

尽管这意味着失望的用户会收到错误消息,但自从进行此更改以来,我还没有发生群集崩溃,也没有看到可怕的服务器消失"错误.

Although this means disappointed users receiving errors, I haven't had a cluster crash since this change and haven't seen the dreaded "server gone away" error.

这篇关于如何阻止一个节点上的死锁使整个群集崩溃?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆