调试 JBoss 100% CPU 使用率 [英] debugging JBoss 100% CPU usage

查看:51
本文介绍了调试 JBoss 100% CPU 使用率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最初发布在关于服务器故障,有人建议最好问这个问题在这里.

Originally posted on Server Fault, where it was suggested this question might better asked here.

我们使用 JBoss 来运行我们的两个 WAR.一个是我们的网络应用程序,另一个是我们的网络服务.Web 应用程序访问另一台机器上的数据库并向 Web 服务发出请求.Web 服务向其他机器发出 JMS 请求,聚合数据并返回.

We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

在我们最大的客户中,JBoss Java 进程大约每月一次占用所有 CPU 的 100%.运行 JBoss 的机器有 8 个 CPU.在此期间,我们的网络应用程序仍可访问,但页面加载需要大约 3 分钟.重启 JBoss 即可恢复正常.

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

数据库机器和所有其他机器都很好,只有运行JBoss的机器受到影响.内存使用正常.网络利用率正常.JBoss 日志中没有可疑的错误消息.

The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

我已经建立了一个尽可能接近客户生产环境的测试环境,并且我已经用多达 2 倍的并发用户数进行了负载测试.我还没有得到我的测试环境来复制问题.

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

我们从这里去哪里?我们如何缩小问题的范围?

Where do we go from here? How can we narrow down the problem?

目前我们唯一的计划是等到问题自己在生产中出现,然后进行一些调试以确定原因.到目前为止,当问题发生时,人们只是重新启动了 JBoss,以尽量减少停机时间.下次发生时,他们会让开发人员查看一下.问题是,下次发生时,可以做些什么来确定原因?

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

我们可以在同一个机器上设置一个单独的 JBoss 实例,并与 Web 服务分开安装 Web 应用程序.这样当问题下一次出现时,我们就会知道哪个 WAR 有问题(假设它是我们的代码).但这并没有缩小范围.

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

我应该启用 JMX 远程吗?这样下次出现问题时,我可以连接 VisualVM 并查看哪些线程正在占用 CPU 以及它们到底在做什么.但是,在生产环境中启用 JMX 远程是否有明显的缺点?

Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

是否有另一种方法可以查看哪些线程正在占用 CPU 并获取堆栈跟踪以查看它们在做什么?

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

还有其他想法吗?

谢谢!

推荐答案

我认为您绝对应该尝试设置一个带有一些负载测试的测试环境,以便重现您的问题.分析肯定有助于查明问题.

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.

一个快速的解决方法是下次用 kill -3 杀死 jboss 以获得转储进行分析.我要检查的第二件事是您正在使用 -server 标志运行并且您的 gc 设置正常.您也可以运行一些 dstat 来查看进程在锁定期间正在做什么.但同样 - 设置负载测试环境(通过 EC2 左右)来重现这一点可能更安全.

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.

这篇关于调试 JBoss 100% CPU 使用率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆