当JVM无法达到安全点时如何获得Java堆栈 [英] How to get Java stacks when JVM can't reach a safepoint

查看:195
本文介绍了当JVM无法达到安全点时如何获得Java堆栈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近有一种情况,我们的一个生产JVM会随机冻结。 Java进程正在燃烧CPU,但所有可见的活动都将停止:没有日志输出,没有写入GC日志,没有响应任何网络请求等等。该进程将一直持续到此状态,直到重新启动。



事实证明,org.mozilla.javascript.DToA类在某些输入上被调用时会感到困惑,并以巨大的值调用BigInteger.pow(例如5 ^ 2147483647),这会触发JVM冻结。我的猜测是,一些大的循环,可能在java.math.BigInteger.multiplyToLen中,没有在循环内进行安全点检查而被JIT化。下一次JVM需要暂停垃圾回收时,它会冻结,因为运行BigInteger代码的线程很长一段时间不会达到安全点。



我的问题:将来如何诊断这样的安全点问题?杀-3没有产生任何输出;我认为它依靠安全点生成精确的堆栈。是否有任何生产安全的工具可以从正在运行的JVM中提取堆栈而无需等待安全点? (在这种情况下,我很幸运,并且在BigInteger.pow被调用之后设法抓取一组堆栈痕迹,但是在它完成了一个足够大的输入以完全嵌入JVM之前,没有那种运气,我我不确定我们是如何诊断出这个问题的。)



编辑:下面的代码说明了问题。

b
$ b

  //产生一个后台线程来计算一个巨大的数字。 
Thread Thread(){@Override public void run(){
try {
Thread.sleep(5000);
} catch(InterruptedException ex){
}
BigInteger.valueOf(5).pow(100000000);
}}。start();

//循环,分配内存并定期记录进度,说明GC暂停时间。
byte [] b;
for(int outer = 0;; outer ++){
long startMs = System.currentTimeMillis();
for(int inner = 0; inner< 100000; inner ++){
b = new byte [1000];


System.out.println(Iteration+ outer +took+(System.currentTimeMillis() - startMs)+ms);
}

这将启动一个后台线程,等待5秒钟,然后启动一个巨大的BigInteger计算。在前台,它会重复分配一系列100,000个1K块,记录每个100MB系列的运行时间。在5秒钟的时间内,每个100MB系列在我的MacBook Pro上运行约20毫秒。一旦BigInteger计算开始,我们开始看到交错的长时间停顿。在一次测试中,暂停时间依次为175ms,997ms,2927ms,4222ms和22617ms(此时我终止测试)。这与BigInteger.pow()调用一系列越来越大的乘法操作是一致的,每次乘法操作都需要更长的时间才能达到安全点。 解决方案

你的问题非常感兴趣。你对JIT是正确的。首先我尝试玩GC类型,但这没有任何作用。然后我试着禁用JIT,一切正常:

  java -Djava.compiler = NONE测试

然后打印出JIT编译:

  java -XX:+ PrintCompilation测试

注意到这个问题是在BigInteger类的一些编译之后开始的,我试图从编译中逐一排除方法,最后找到原因:

  java -XX:CompileCommand = exclude,java / math / BigInteger,multiplyToLen -XX:+ PrintCompilation测试

对于大型数组,这种方法可能需要很长时间,而且问题可能确实在安全点。出于某种原因,他们没有插入,但应该是编译的代码。看起来像一个错误。下一步应该是分析汇编代码,我还没有这样做。


We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.

It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.

My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)

Edit: the following code illustrates the problem.

// Spawn a background thread to compute an enormous number.
new Thread(){ @Override public void run() {
  try {
    Thread.sleep(5000);
  } catch (InterruptedException ex) {
  }
  BigInteger.valueOf(5).pow(100000000);
}}.start();

// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
  long startMs = System.currentTimeMillis();
  for (int inner = 0; inner < 100000; inner++) {
    b = new byte[1000];
  }

  System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}

This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.

解决方案

Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:

java -Djava.compiler=NONE Tests

Then printed out JIT compilations:

java -XX:+PrintCompilation Tests

And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:

java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests

For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.

这篇关于当JVM无法达到安全点时如何获得Java堆栈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆