当 JVM 无法到达安全点时如何获取 Java 堆栈 [英] How to get Java stacks when JVM can't reach a safepoint

查看:28
本文介绍了当 JVM 无法到达安全点时如何获取 Java 堆栈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近遇到了一种情况,我们的一个生产 JVM 会随机冻结.Java 进程正在消耗 CPU,但所有可见活动都将停止:没有日志输出、没有任何内容写入 GC 日志、没有对任何网络请求的响应等.进程将一直保持这种状态,直到重新启动.

事实证明,当在某些输入上调用 org.mozilla.javascript.DToA 类时,会感到困惑并以巨大的值(例如 5^2147483647)调用 BigInteger.pow,这会触发 JVM 冻结.我的猜测是,一些大循环,可能在 java.math.BigInteger.multiplyToLen 中,是 JIT 的,没有在循环内进行安全点检查.下次 JVM 需要暂停垃圾回收时,它会冻结,因为运行 BigInteger 代码的线程在很长一段时间内都不会到达安全点.

我的问题:将来,我如何诊断这样的安全点问题?kill -3 没有产生任何输出;我认为它依赖于安全点来生成准确的堆栈.是否有任何生产安全工具可以从正在运行的 JVM 中提取堆栈而无需等待安全点?(在这种情况下,我很幸运并在 BigInteger.pow 被调用之后设法获取了一组堆栈跟踪,但在它达到足够大的输入以完全楔入 JVM 之前.如果没有运气,我'不确定我们将如何诊断问题.)

编辑:以下代码说明了问题.

//产生一个后台线程来计算一个巨大的数字.新线程(){@Override公共无效运行(){尝试 {线程.sleep(5000);} 捕捉(InterruptedException ex){}BigInteger.valueOf(5).pow(100000000);}}.开始();//循环,分配内存并定期记录进度,因此说明 GC 暂停时间.字节[] b;for (int external = 0; ; outer++) {long startMs = System.currentTimeMillis();for (int inner = 0; inner <100000; inner++) {b = 新字节[1000];}System.out.println("迭代" + 外层 + " 占用 " + (System.currentTimeMillis() - startMs) + " ms");}

这会启动一个等待 5 秒的后台线程,然后开始一个巨大的 BigInteger 计算.然后在前台重复分配一系列 100,000 个 1K 块,记录每个 100MB 系列的经过时间.在 5 秒的时间里,每个 100MB 系列在我的 MacBook Pro 上运行大约 20 毫秒.一旦 BigInteger 计算开始,我们开始看到交错的长时间停顿.在一项测试中,暂停时间依次为 175 毫秒、997 毫秒、2927 毫秒、4222 毫秒和 22617 毫秒(此时我中止了测试).这与 BigInteger.pow() 调用一系列越来越大的乘法运算一致,每个运算都需要更长的时间才能达到安全点.

解决方案

你的问题让我很感兴趣.您对 JIT 的看法是正确的.首先我尝试使用 GC 类型,但这没有任何效果.然后我尝试禁用 JIT,一切正常:

java -Djava.compiler=NONE 测试

然后打印出 JIT 编译:

java -XX:+PrintCompilation 测试

注意到在BigInteger类中进行一些编译后出现问题,我尝试从编译中逐个排除方法,终于找到了原因:

java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests

对于大型数组,此方法可能会持续很长时间,而且问题可能确实出在安全点上.由于某种原因,它们没有被插入,但甚至应该在已编译的代码中.看起来像一个错误.下一步应该是分析汇编代码,我还没做.p>

We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.

It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.

My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)

Edit: the following code illustrates the problem.

// Spawn a background thread to compute an enormous number.
new Thread(){ @Override public void run() {
  try {
    Thread.sleep(5000);
  } catch (InterruptedException ex) {
  }
  BigInteger.valueOf(5).pow(100000000);
}}.start();

// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
  long startMs = System.currentTimeMillis();
  for (int inner = 0; inner < 100000; inner++) {
    b = new byte[1000];
  }

  System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}

This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.

解决方案

Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:

java -Djava.compiler=NONE Tests

Then printed out JIT compilations:

java -XX:+PrintCompilation Tests

And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:

java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests

For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.

这篇关于当 JVM 无法到达安全点时如何获取 Java 堆栈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆