Hung JVM消耗100%的CPU [英] Hung JVM consuming 100% CPU

查看:166
本文介绍了Hung JVM消耗100%的CPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Linux 32位(CentOS)上运行Sun JRE 6u20上的JAVA服务器。我们将服务器热点与CMS收集器一起使用以下选项(我只提供了相关选项):

We have a JAVA server running on Sun JRE 6u20 on Linux 32-bit (CentOS). We use the Server Hotspot with CMS collector with the following options (I've only provided the relevant ones):

-Xmx896m -Xss128k -XX:NewSize=384M -XX:MaxPermSize=96m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

有时,运行一段时间后,JVM似乎陷入挂起状态,即使我们没有向应用程序发出任何请求,CPU也继续以100%旋转(我们有8个逻辑CPU,所以它看起来只有一个CPU进行旋转)。
在这种状态下,JVM不响应SIGHUP信号(kill -3),我们无法通过jstack正常连接它。我们可以用jstack -F连接,但输出很狡猾(我们可以看到很多来自JStack的NullPointerExceptions显然是因为它无法走某些堆栈)。所以jstack -F输出似乎没用。

Sometimes, after running for a while, the JVM seems to slip into a hung state, whereby even though we don't make any requests to the application, the CPU continues to spin at 100% (we have 8 logical CPUs, so it looks like only one CPU does the spinning). In this state the JVM doesn't respond to SIGHUP signals (kill -3) and we can't connect to it normally with jstack. We CAN connect with "jstack -F", but the output is dodgy (we can see lots of NullPointerExceptions from JStack apparently because it wasn't able to 'walk' some stacks). So the "jstack -F" output seems to be useless.

我们已经从gdb运行了堆栈转储,我们能够匹配线程ID旋转CPU(我们发现使用带有每个线程视图的top - H选项),并在gdb结果中显示一个线程堆栈,这就是它的样子:

We have run a stack dump from "gdb" though, and we were able to match the thread id that spins the CPU (we found that using "top" with a per-thread view - "H" option) with a thread stack that appears in the gdb result and this is how it looks like:

Thread 443 (Thread 0x7e5b90 (LWP 26310)):
#0  0x0115ebd3 in CompactibleFreeListSpace::block_size(HeapWord const*) const () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#1  0x01160ff9 in CompactibleFreeListSpace::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#2  0x0123456c in Generation::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#3  0x01229b2c in GenCollectedHeap::prepare_for_compaction() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#4  0x0122a7fc in GenMarkSweep::invoke_at_safepoint(int, ReferenceProcessor*, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#5  0x01186024 in CMSCollector::do_compaction_work(bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#6  0x011859ee in CMSCollector::acquire_control_and_collect(bool, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#7  0x01185705 in ConcurrentMarkSweepGeneration::collect(bool, bool, unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#8  0x01227f53 in GenCollectedHeap::do_collection(bool, bool, unsigned int, bool, int) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#9  0x0115c7b5 in GenCollectorPolicy::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#10 0x0122859c in GenCollectedHeap::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#11 0x0158a8ce in VM_GenCollectForAllocation::doit() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#12 0x015987e6 in VM_Operation::evaluate() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#13 0x01597c93 in VMThread::evaluate_operation(VM_Operation*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#14 0x01597f0f in VMThread::loop() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#15 0x015979f0 in VMThread::run() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#16 0x0145c24e in java_start(Thread*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#17 0x00ccd46b in start_thread () from /lib/libpthread.so.0
#18 0x00bc2dbe in clone () from /lib/libc.so.6

似乎JVM线程正在进行一些与CMS相关的工作。
我们检查了盒子上的内存使用情况,似乎有足够的可用内存,系统没有交换。
有没有人遇到过这样的情况?它看起来像JVM错误吗?

It seems that a JVM thread is spinning while doing some CMS related work. We have checked the memory usage on the box, there seems to be enough memory available and the system is not swapping. Has anyone come across such a situation? Does it look like a JVM bug?

UPDATE

我已经获得有关此问题的更多信息(它在已运行超过7天的服务器上再次发生)。
当JVM进入挂起状态时,它会保持2小时,直到手动重启服务器。我们已经获得了进程的核心转储和gc日志。我们也试图获得堆转储,但jmap失败了。我们尝试使用jmap -F但是在程序中止之前只写了一个4Mb的文件,但有异常(某些内存位置无法访问)。
到目前为止,我认为最有趣的信息来自gc日志。似乎GC日志记录也停止了(可能在VM线程进入长循环时):

I've obtained some more information about this problem (it happened again on a server that has been running for more than 7 days). When the JVM entered the "hung" state it stayed like that for 2 hours until the server was manually restarted. We have obtained a core dump of the process and the gc log. We tried to get a heap dump as well, but "jmap" failed. We tried to use jmap -F but then only a 4Mb file was written before the program aborted with an exception (something about the a memory location not being accessible). So far I think the most interesting information comes from the gc log. It seems that the GC logging stopped as well (possibly at the time when the VM thread went into the long loop):

657501.199: [Full GC (System) 657501.199: [CMS: 400352K->313412K(524288K), 2.4024120 secs] 660634K->313412K(878208K), [CMS Perm : 29455K->29320K(68568K)], 2.4026470 secs] [Times: user=2.39 sys=0.01, real=2.40 secs] 
657513.941: [GC 657513.941: [ParNew: 314624K->13999K(353920K), 0.0228180 secs] 628036K->327412K(878208K), 0.0230510 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657523.772: [GC 657523.772: [ParNew: 328623K->17110K(353920K), 0.0244910 secs] 642036K->330523K(878208K), 0.0247140 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657535.473: [GC 657535.473: [ParNew: 331734K->20282K(353920K), 0.0259480 secs] 645147K->333695K(878208K), 0.0261670 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] 
....
....
688346.765: [GC [1 CMS-initial-mark: 485248K(524288K)] 515694K(878208K), 0.0343730 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
688346.800: [CMS-concurrent-mark-start]
688347.964: [CMS-concurrent-mark: 1.083/1.164 secs] [Times: user=2.52 sys=0.09, real=1.16 secs] 
688347.964: [CMS-concurrent-preclean-start]
688347.969: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
688347.969: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 688352.986: [CMS-concurrent-abortable-preclean: 2.351/5.017 secs] [Times: user=3.83 sys=0.38, real=5.01 secs] 
688352.987: [GC[YG occupancy: 297806 K (353920 K)]688352.987: [Rescan (parallel) , 0.1815250 secs]688353.169: [weak refs processing, 0.0312660 secs] [1 CMS-remark: 485248K(524288K)] 783055K(878208K), 0.2131580 secs] [Times: user=1.13 sys
=0.00, real=0.22 secs] 
688353.201: [CMS-concurrent-sweep-start]
688353.903: [CMS-concurrent-sweep: 0.660/0.702 secs] [Times: user=0.91 sys=0.07, real=0.70 secs] 
688353.903: [CMS-concurrent-reset-start]
688353.912: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
688354.243: [GC 688354.243: [ParNew: 344928K->30151K(353920K), 0.0305020 secs] 681955K->368044K(878208K), 0.0308880 secs] [Times: user=0.15 sys=0.00, real=0.03 secs]
....
....
688943.029: [GC 688943.029: [ParNew: 336531K->17143K(353920K), 0.0237360 secs] 813250K->494327K(878208K), 0.0241260 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 
688950.620: [GC 688950.620: [ParNew: 331767K->22442K(353920K), 0.0344110 secs] 808951K->499996K(878208K), 0.0347690 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] 
688956.596: [GC 688956.596: [ParNew: 337064K->37809K(353920K), 0.0488170 secs] 814618K->515896K(878208K), 0.0491550 secs] [Times: user=0.18 sys=0.04, real=0.05 secs] 
688961.470: [GC 688961.471: [ParNew (promotion failed): 352433K->332183K(353920K), 0.1862520 secs]688961.657: [CMS

我怀疑这个问题与日志中的最后一行有关(我已按顺序添加了一些......跳过一些不感兴趣的行。)
服务器停留在挂起状态2小时(可能是尝试GC和压缩老一代)的事实对我来说似乎很奇怪。此外,gc日志突然停止显示该消息,并且不再有任何其他内容被打印,可能是因为VM线程进入某种无限循环(或需要2个多小时的事情)。

I suspect this problem has something to do with the last line in the log (I've added some "...." in order to skip some lines that were not interesting). The fact that the server stayed in the hung state for 2 hours (probably trying to GC and compact the old generation) seems quite strange to me. Also, the gc log stops suddenly with that message and nothing else gets printed any more, probably because the VM Thread gets into some sort of infinite loop (or something that takes 2+ hours).

推荐答案

它看起来像GC调整问题,可能是由内存泄漏引起的。我建议您添加启用GC日志记录的JVM参数,看看是否能为您提供任何见解。

It looks like a GC tuning problem, possibly triggered by a memory leak. I suggest that you add the JVM parameter that turns on GC logging, and see if that gives you any insights.


它看起来像是JVM错误?

Does it look like a JVM bug?

不是我。

当你用完了堆空间(例如,由于存储泄漏),通常会看到GC越来越频繁地运行并且使用越来越大的CPU百分比。您可以设置一些JVM参数来避免这种情况,但它们会导致您的JVM更快地使用OOME。 (这可能听起来很奇怪,但在很多情况下这是一件好事,因为它会缩短系统重启之前的时间并且能够再次正常运行。)

When you run out of heap space (for example, as a result of a storage leak) it is common to see the GC running more and more frequently and using a larger and larger percentage of the CPU. There are some JVM parameters that you can set to avoid this, but they have the effect of causing your JVM to keel over sooner with an OOME. (This may sound strange, but it in many cases this is a good thing because it reduces the time until your system is restarted and is able to run normally again.)

这篇关于Hung JVM消耗100%的CPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆