oracle行争用导致高吞吐量JMS应用程序中的死锁错误 [英] oracle row contention causing deadlock errors in high throughtput JMS application

查看:151
本文介绍了oracle行争用导致高吞吐量JMS应用程序中的死锁错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要:

我想知道对于具有批量消息试图更新同一行并获得oracle死锁错误的高吞吐量应用程序的最佳实践是什么.我知道您无法避免这些错误,但是如何从这些错误中优雅地恢复,而不会被此类反复发生的死锁错误所困扰.

I am interested in knowing what's the best practice for high throughput applications that have bulk messages trying to update the same row and get oracle deadlock errors. I know you cannot avoid those errors but how do you recover from them gracefully without getting bogged down by such deadlock errors happening over and over again.

详细信息:

我们正在构建一个高吞吐量的JMS消息传递应用程序.生产环境将是两个weblogic 11g节点(每个节点运行6个MDB侦听器实例).当我们收到约1000条消息都试图更新oracle数据库中的同一行时,我们遇到了Oracle死锁错误(ORA-00060).在标准的Java线程API中,跨节点的Java同步是不可能的(除非没有其他解决方案,我们不想使用任何第三方解决方案,例如terracotta等).

We are building a high throughput JMS messaging application. Production environment will be two weblogic 11g nodes (running 6 MDB listener instances each). We were getting Oracle deadlock errors (ORA-00060) when we get around 1000 messages all trying to update the same row in oracle database. Java synchronization across nodes is not possible in standard java threading API (unless there's no other solution we don't want to use any 3rd party solutions like terracotta etc).

我们希望Oracle选择更新等待n秒"语句将有所帮助,因为这实际上会使竞争线程(对于同一行)在第一个线程(首先获得该行锁定)之前等待几秒钟.完成它.

We were hoping Oracle "select for update WAIT n secs" statement will help because that will essentially make the competing threads (for the same row) wait few seconds before the first thread (who got the lock on the row first) gets done with it.

"SELECT FOR UPDATE WAIT n"的第一个问题是不允许使用毫秒作为等待时间.这开始对我们的应用程序的吞吐量产生负面影响,因为放置1秒的WAIT(最短等待时间)会导致消息延迟.

First issue with "SELECT FOR UPDATE WAIT n" is it doesn't allow using milliseconds for wait times. This starts negatively affecting our application's throughput because putting 1 sec WAIT (least wait time) causes delays on the messages.

第二件事是我们摆弄了weblogic队列重新传递延迟参数(在本例中为30秒).每当线程由于死锁错误而反弹时,它将等待30秒钟,然后重新尝试.

Second thing we are fiddling with weblogic queue re-delivery delay parameter (30 secs in our case). Whenever a thread bounces back because of the deadlock error, it will wait 30 seconds before being re-tried.

根据我们的经验1000条竞争性消息,在很多情况下,由于死锁不断地反复发生,需要花费很多时间来处理它们.

In our experience 1000 competing messages, in a lot of situations take forever to get processed because the deadlock keeps on happening over and over.

我知道,在当前体系结构下,无论(如果有1000条竞争消息),我们都应该遇到死锁错误,但是应用程序应具有足够的弹性,以在重试循环消息后从这些错误中恢复过来.

I understand that with the current architecture we are supposed to get deadlock errors regardless ( in case of 1000 competing messages) but application should be resilient enough to recover from these errors after retrying the looping messages.

您知道我们在这里缺少什么吗?以前曾经处理过类似问题的人吗?

Any idea what we are missing here ? anybody who has dealt with similar issues before?

我正在寻找一些可以使此工作有弹性的设计想法,以使其从这种死锁情况中恢复过来,并最终在合理的时间内处理所有消息,而无需使用很多额外的硬件.

I am looking for some design ideas that can make this work resiliently so that it recovers from this deadlock situation and eventually processes all messages in reasonable amount of time without using much additional hardware.

计算细节: 这1000条消息将分别创建4个不同位置类型的4个对象,每个对象都具有与之关联的数量.这些数量将必须合并到这4个不同的插槽中(取决于仓位类型).当每个单独的线程正在更新这四个单独的插槽时,就会发生死锁.在应用到数据库行之前,我们已经按照特定顺序对这些单独的更新进行了排序,以避免出现任何可能的竞争情况.

COMPUTATION DETAILS: These 1000 messages will EACH create 4 objects of 4 different position types each having a quantity associated with it. These quantities will have to merged into those 4 different slots (depending on the position type). The deadlock is happening when those 4 individual slots are being updated by each individual thread. We have already ordered those individual updates in a specific order before being applied to the database rows to avoid any possible race conditions.

推荐答案

死锁意味着每个线程试图在单个事务中更新多个行,并且这些更新在线程之间以不同的顺序进行.因此,最简单的可能答案是修改代码,以便以某种定义的顺序(即,按主键的顺序)应用同一事务中的消息.这样可以确保即使一个线程等待另一个线程提交其事务时,您仍然会获得阻塞锁,但您永远不会出现死锁.

A deadlock implies that each thread is trying to update multiple rows in a single transaction and that those updates are being done in a different order across threads. The simplest possible answer, therefore, would be to modify the code so that messages within the same transaction are applied in some defined order (i.e. in order of the primary key). That would ensure that you would never get a deadlock though you'd still get blocking locks while one thread waits for another thread to commit its transaction.

但是,退后一步,当您无法预测更新的顺序时,您似乎不太希望真正有很多线程更新表中的同一行.极有可能导致大量更新丢失以及某些相当不可预测的行为.确切地说,您的应用程序在执行什么操作会使这种事情变得明智?您是否正在执行类似将行插入到明细表中之后更新聚合表的操作(即,除了记录有关特定视图的信息以外,还更新帖子所包含的视图数的计数)?如果是这样,那么这些操作真的需要同步吗?还是可以通过汇总过去N秒内的视图来定期更新另一个线程中的视图计数?

Taking a step back, though, it seems unlikely that you would really want many threads updating the same row in a table when you can't predict the order of the updates. It seems highly likely that would lead to lots of lost updates and some rather unpredictable behavior. What, exactly, is your application doing that would make this sort of thing sensible? Are you doing something like updating aggregate tables after inserting rows into a detail table (i.e. updating the count of the number of views a post has in addition to logging information about a particular view)? If so, do those operations really need to be synchronous? Or could you update the view count periodically in another thread by aggregating the views over the past N second?

这篇关于oracle行争用导致高吞吐量JMS应用程序中的死锁错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆