在大型Java堆转储中查找内存泄漏的方法 [英] Method for finding memory leak in large Java heap dumps

查看:151
本文介绍了在大型Java堆转储中查找内存泄漏的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在Java应用程序中发现内存泄漏。我对此有一些经验,但希望就此采用方法/策略方面的建议。欢迎任何参考和建议。



关于我们的情况:


  1. 堆转储大于1 GB

  2. 我们有5次堆转储。

  3. 我们没有任何测试用例来激发这种情况。它只发生在(大规模)系统测试环境中,经过至少一周的使用。

  4. 系统建立在内部开发的遗留框架上,有很多设计缺陷,无法计算他们都是。

  5. 没有人深入了解框架。它已被转移到印度的一个家伙,几乎没有及时回复电子邮件。

  6. 我们已经做了快照堆转储,并得出结论有没有一个组件随着时间的推移而增加这是一切都在慢慢增长。

  7. 上面指出的是框架本土的ORM系统无限制地增加其使用的方向。 (这个系统将对象映射到文件?!所以不是真正的ORM)

问题: 在企业级应用程序中帮助您成功追踪泄漏的方法是什么?

解决方案

这几乎是不可能的没有理解底层代码。如果您了解底层代码,那么您可以更好地从堆垃圾堆中获取的大量信息中查看小麦。



此外,您可以'在不知道为什么班级出现在那里的情况下才知道某件事是否泄漏。



过去几周我只是在做这件事,而我使用迭代过程。



首先,我发现堆分析器基本没用。他们无法有效地分析巨大的堆。



相反,我几乎完全依赖 jmap 直方图。



我想你熟悉这些,但对于那些没有:

  jmap -histo:live< pid> > dump.out 

创建实时堆的直方图。简而言之,它告诉你类名,以及每个类在堆中的实例数。



我每隔5分钟,24小时定期倾倒堆一天。这对你来说可能过于精细,但要点是相同的。



我对这些数据进行了几次不同的分析。



我写了一个脚本来获取两个直方图,并将它们之间的差异排除在外。因此,如果java.lang.String在第一个转储中为10,而在第二个转储中为15,我的脚本会吐出5 java.lang.String,告诉我它上升了5.如果它已经下降,数字将为负数。



然后,我将采取其中的几个差异,删除从运行到运行的所有类,并获取结果的并集。最后,我有一个在特定时间跨度内不断增长的课程列表。显然,这些是泄漏课程的主要候选人。



然而,有些课程保留了一些,而其他课程则是GC。这些类总体上很容易上下,但仍然会泄漏。因此,他们可能会脱离永远上升的类别。



为了找到这些,我将数据转换为时间序列并将其加载到数据库中,Postgres具体。 Postgres非常方便,因为它提供统计汇总功能,这样您就可以对数据进行简单的线性回归分析,并找到趋势向上,即使他们并不总是排在榜首之上。我使用了regr_slope函数,寻找具有正斜率的类。



我发现这个过程非常成功,而且效率很高。直方图文件并不是非常庞大,并且很容易从主机下载它们。在生产系统上运行它们并不是非常昂贵(它们会强制使用大型GC,并且可能会阻塞VM一段时间)。我在一个带有2G Java堆的系统上运行它。



现在,所有这一切都可以识别潜在泄漏的类。



这是了解课程如何使用的地方,以及他们是否应该参与其中。



例如,你可能会发现你有很多Map.Entry类,或者其他一些系统类。



除非你只是缓存String,否则事实是这些系统类,可能是罪犯,不是问题。如果您正在缓存某些应用程序类,那么该类可以更好地指示您的问题所在。如果你没有缓存com.app.yourbean,那么就不会将相关的Map.Entry绑定到它。



一旦你有一些课程,你可以开始抓取代码库以查找实例和引用。由于您拥有自己的ORM层(无论好坏),您至少可以轻松查看源代码。如果您正在缓存ORM,它可能会缓存包装您的应用程序类的ORM类。



最后,您可以做的另一件事是,一旦您了解了类,就可以开始一个本地服务器实例,具有更小的堆和更小的数据集,并使用其中一个分析器。



在这种情况下,你可以进行单元测试这只会影响您认为可能泄漏的1件(或少数件)。例如,您可以启动服务器,运行直方图,执行单个操作,然后再次运行直方图。你泄漏的课程应该增加1(或者你的工作单位是什么)。



分析师可以帮助你跟踪现在泄露的所有者但是,最后,你将需要对你的代码库有一些了解,以便更好地理解什么是泄漏,什么不是,以及为什么堆中存在一个对象,更不用说为什么它可能被保留为堆中的泄漏。


I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.

About our situation:

  1. Heap dumps are larger than 1 GB
  2. We have heap dumps from 5 occasions.
  3. We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
  4. The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
  5. Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
  6. We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
  7. The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)

Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?

解决方案

It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.

Also, you can't know if something is a leak or not without know why the class is there in the first place.

I just spent the past couple of weeks doing exactly this, and I used an iterative process.

First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.

Rather, I relied almost solely on jmap histograms.

I imagine you're familiar with these, but for those not:

jmap -histo:live <pid> > dump.out

creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.

I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.

I ran several different analyses on this data.

I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.

I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.

However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.

To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.

I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.

Now, all this can do is identify potentially leaking classes.

This is where understanding how the classes are used, and whether they should or should not be their comes in to play.

For example, you may find that you have a lot of Map.Entry classes, or some other system class.

Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.

Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.

Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.

In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).

A profiler may be able to help you track the owners of that "now leaked" class.

But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.

这篇关于在大型Java堆转储中查找内存泄漏的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆