调试JBoss 100%CPU使用率 [英] debugging JBoss 100% CPU usage

查看:216
本文介绍了调试JBoss 100%CPU使用率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最初发布有关服务器故障的信息,在此建议最好问这个问题在这里.

Originally posted on Server Fault, where it was suggested this question might better asked here.

我们正在使用JBoss来运行两个WAR.一个是我们的Web应用程序,另一个是我们的Web服务. Web应用程序访问另一台计算机上的数据库,并向Web服务发出请求.该Web服务向其他计算机发出JMS请求,汇总数据,然后将其返回.

We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

在我们最大的客户中,JBoss Java进程每月大约需要使用一次100%的所有CPU.运行JBoss的计算机有8个CPU.在此期间,我们的Web应用仍可访问,但是页面加载大约需要3分钟.重新启动JBoss会使一切恢复正常.

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

数据库机器和所有其他机器都正常,只有运行JBoss的机器受到影响.内存使用情况正常.网络利用率正常. JBoss日志中没有可疑的错误消息.

The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

我已经建立了一个尽可能接近客户端生产环境的测试环境,并且我进行的负载测试的并发用户数量是其并发用户数量的2倍之多.我还没有获得测试环境来复制问题.

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

我们从这里去哪里?我们如何缩小问题的范围?

Where do we go from here? How can we narrow down the problem?

当前,我们唯一的计划是等到问题在生产中自行发生后,再进行一些调试以确定原因.到目前为止,人们只是在发生问题时才重新启动JBoss,以最大程度地减少停机时间.下次发生这种情况时,他们将邀请开发人员进行查看.问题是,下次发生时,可以怎么确定原因?

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

我们可以在同一盒子上设置一个单独的JBoss实例,然后将Web应用程序与Web服务分开安装.这样,下次发生问题时,我们将知道哪个WAR存在问题(假设这是我们的代码).但这并不会缩小范围.

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

我应该启用JMX远程吗?这样,下次发生问题时,我可以与VisualVM连接,并查看哪些线程占用了CPU以及它们在做什么.但是,在生产环境中启用JMX远程功能有很大的不利之处吗?

Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

还有另一种方法来查看哪些线程正在消耗CPU并获取堆栈跟踪以查看其运行情况吗?

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

还有其他想法吗?

谢谢!

推荐答案

我认为您绝对应该尝试通过一些负载测试来设置测试环境,以便重现您的问题.剖析绝对有助于查明问题.

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.

一个快速的解决方法是下次使用kill -3杀死jboss,以便进行转储分析.我要检查的第二件事是您正在使用-server标志运行,并且您的gc设置是正常的.您也可以只运行一些dstat来查看锁定期间该进程在做什么.但是,再次重申-设置负载测试环境(通过EC2左右)来重现此环境可能更安全.

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.

这篇关于调试JBoss 100%CPU使用率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆