如何阻止一个节点上的死锁使整个集群崩溃? [英] How to stop a deadlock on one node from crashing entire cluster?

查看:37
本文介绍了如何阻止一个节点上的死锁使整个集群崩溃?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 MariaDB 下运行一个 3x 节点 Galera 集群.该应用程序在 PHP 中使用 mysqli 扩展.

I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.

我偶尔会遇到 死锁 写入.我正在努力改进我的应用程序以处理或避免这种故障,但同时我需要集群在发生这种情况时保持正常运行.

Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.

问题是一旦发生死锁,集群中的不仅仅是一个节点,而是所有三个节点都崩溃了.发生死锁的节点遭受 MySQL 服务器已消失错误,并且在 max_connect_errors 开始永久拒绝连接后,因此需要手动重新启动服务器.

The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors starts refusing connections permanently, thus requiring a manual server restart.

我不明白为什么其他节点也会关闭.它们都以WSREP 尚未准备好节点供应用程序使用"开始错误,这意味着整个应用程序崩溃,没有数据库节点接受连接.

What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.

当一个节点遭遇罕见的死锁时,如何确保集群的其余部分保持正常运行?

How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?

更新:

一个月后,另一个死锁导致了类似的问题.再一次,一个节点会破坏一切.

A month later and another deadlock causes a similar problem. Again, one node brings down everything.

第一个连接出现死锁(在提交阶段),因此应用程序尝试重播事务.这挂了将近一分钟,然后再次失败.

The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.

在第一个连接恢复失败后,所有其他连接开始失败,并出现 (1205) "Lock wait timeout exceeded" 导致整个集群无用.

After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.

我应该补充一点,应用程序不使用锁.然而,它本身就是一个结,它只是与常规事务查询有关.

I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.

推荐答案

我正在回答我自己的问题,因为我已经设法避免了崩溃.但是,我仍然遇到次要错误的问题,并已启动 一个新线程 详细说明.

I'm answering my own question as I've managed to avoid crashes. However, I still have problems with secondary errors and have started a new thread with the specifics.

我的恢复代码现在以不同方式处理次要错误.它将重试死锁几次,但仅限于错误是死锁时.如果发生任何其他类型的错误,应用程序将放弃.

My recovery code now handles secondary errors differently. It will retry deadlocks a couple of times, but only while the error is a deadlock. If any other type of error occurs the application will give up.

虽然这意味着收到错误的用户很失望,但自此更改以来我还没有遇到过集群崩溃,也没有看到可怕的服务器消失"错误.

Although this means disappointed users receiving errors, I haven't had a cluster crash since this change and haven't seen the dreaded "server gone away" error.

这篇关于如何阻止一个节点上的死锁使整个集群崩溃?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆