Java垃圾收集器 - 定期运行不正常 [英] Java Garbage Collector - Not running normally at regular intervals

查看:97
本文介绍了Java垃圾收集器 - 定期运行不正常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不断运行的程序。通常,它似乎是垃圾收集,并保持在大约8MB的内存使用量。但是,每个周末,它都会拒绝垃圾收集,除非我明确打电话给它。但是,如果它接近最大堆大小,它仍将是垃圾收集。然而,注意到这个问题的唯一原因是因为它实际上在一个周末因内存耗尽而崩溃,即它必须达到最大堆大小,而不是运行垃圾收集器。

I have a program that is constantly running. Normally, it seems to garbage collect, and remain under about 8MB of memory usage. However, every weekend, it refuses to garbage collect unless I make an explicit call to it. However, if it nears the maximum heap size, it will still garbage collect. However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

以下图片(点击看)是一天内程序内存使用情况的图表。在图的两侧,您可以看到程序内存使用情况的正常行为,但第一个大峰似乎是周末开始的。这个特殊的图是一个奇怪的例子,因为在我对垃圾收集器进行了一次显式调用之后,它成功运行了,但随后它又回到了最大堆大小并且成功地将垃圾收集了两次。

The following image (click to see) is a graph of the program's memory usage over a day. On the sides of the graph, you can see the normal behaviour of the program's memory usage, but the first large peak is what seems to start over the weekend. This particular graph is a strange example, because after I made an explicit call to the garbage collector, it ran successfully, but then it went and climbed back to the maximum heap size and successfully garbage collected on it's own twice.

这里发生了什么?

编辑:

好的,从评论来看,似乎我没有提供足够的信息。程序只接收UDP数据包流,这些数据包放在队列中(设置为最大大小为1000个对象),然后对其进行处理,使其数据存储在数据库中。平均而言,它每秒大约有80个数据包,但可以达到150个。它在Windows Server 2008下运行。

Ok, from the comments, it seems I haven't provided enough information. The program simply receives a stream of UDP packets, which are placed in a queue (set to have a maximum size of 1000 objects), which are then processed to have their data stored in a database. On average, it gets about 80 packets per second, but can peak to 150. It's running under Windows Server 2008.

事实是,这个活动是相当一致的,如果有的话,在内存使用开始稳定攀升时,活动应该更低,而不是更高。请注意,我上面发布的图表是我唯一一个延伸到目前为止的图表,因为我只更改了Java Visual VM包装器,以便将图形数据保持足够远,以便在本周看到它,所以我不知道它是否正好每周的同一时间,因为我不能在周末观看它,因为它在私人网络上,而且我周末不上班。

The thing is, this activity is fairly consistent, and if anything, at the time that the memory usage starts it's steady climb, the activity should be lower, not higher. Mind you, the graph I posted above is the only one I have that extends back that far, since I only changed the Java Visual VM wrapper to keep graph data back far enough to see it this week, so I have no idea if it's exactly the same time every week, because I can't watch it over the weekend, as it's on a private network, and I'm not at work on the weekend.

这是第二天的图表:

这几乎就是每周其他日子的内存使用情况。程序永远不会重新启动,我们只会在周一早上告诉它垃圾收集因为这个问题。有一周我们尝试在周五下午重新启动它,它在周末的某个时间仍然开始爬升,所以我们重新启动它的时间似乎与下周的内存使用没有任何关系。

This is pretty much what the memory usage looks like every other day of the week. The program is never restarted, and we only tell it to garbage collect on a Monday morning because of this issue. One week we tried restarting it on a Friday afternoon, and it still started climbing sometime over the weekend, so the time that we restart it doesn't seem to have anything to do with the memory usage next week.

成功垃圾收集所有这些对象的事实,当我们告诉它暗示对象是可收集的时候,它只是在达到最大堆大小之前没有这样做,或者我们显式调用垃圾收集器。堆转储没有告诉我们什么,因为当我们尝试执行一个时,它突然运行垃圾收集器,然后输出堆转储,这当然看起来完全正常。

The fact that it successfully garbage collects all those objects when we tell it to implies to me that the objects are collectable, it's just not doing it until it reaches the maximum heap size, or we explicitly call the garbage collector. A heap dump doesn't tell us anything, because when we try to perform one, it suddenly runs the garbage collector, and then outputs a heap dump, which of course looks perfectly normal at this point.

所以我想我有两个问题:为什么它突然没有垃圾收集它在本周其余时间的方式,以及为什么在一个场合,当达到最大堆大小时发生的垃圾收集无法收集所有这些对象(即为什么会有一次这么多对象的引用,而每隔一段时间一定不会有)?

So I suppose I have two questions: Why is it suddenly not garbage collecting the way it does the rest of the week, and why is it that on one occassion, the garbage collection that occurs when it reaches the maximum heap size was unable to collect all those objects (i.e. why would there be references to so many objects that one time, when every other time there must not be)?

更新:

今天上午很有意思。正如我在评论中提到的,该程序正在客户端的系统上运行。我们在客户组织中的联系人报告说,凌晨1点,该程序失败了,他今天早上上班时必须手动重启,而且服务器时间再次不正确。这是我们过去曾遇到的一个问题,但直到现在,这个问题似乎从来没有相关。

This morning has been an interesting one. As I mentioned in the comments, the program is running on a client's system. Our contact in the client organisation reports that at 1am, this program failed, and he had to restart it manually when he got into work this morning, and that once again, the server time was incorrect. This is an issue we've had with them in the past, but until now, the issue never seemed to be related.

查看我们程序生成的日志,我们可以推断出以下信息:

Looking through the logs that our program produces, we can deduce the following information:


  1. 在01:00,服务器以某种方式重新启动它的时间,将其设置为00:28。

  2. 在00:45(根据新的,不正确的服务器时间),程序中的一个消息处理线程抛出了内存不足错误。

  3. 然而,另一个消息处理线程(我们收到两种类型的消息,它们的处理方式略有不同,但它们都在不停地进入),继续运行,并且像往常一样,内存使用率继续攀升,没有垃圾收集(从我们记录的图表中可以看到,再一次)。

  4. 在00:56,日志停止,直到早上7点左右该程序由我们的客户重新启动。但是,目前的内存使用情况图表仍在稳步增长。

  1. At 01:00, the server has somehow resynced it's time, setting it to 00:28.
  2. At 00:45 (according to the new, incorrect server time), one of the message processing threads in the program has thrown an out of memory error.
  3. However, the other message processing thread (there are two types of messages we receive, they are processed slightly differently, but they are both constantly coming in), continues to run, and as usual, the memory usage continues to climb with no garbage collection (as seen from the graphs we have been recording, once again).
  4. At 00:56, the logs stop, until about 7am when the program was restarted by our client. However, the memory usage graph, for this time, was still steadily increasing.

不幸的是,由于服务器时间的变化,这个使我们的内存使用图上的时间不可靠。但是,它似乎是尝试垃圾收集,失败,将堆空间增加到最大可用大小,并立即杀死该线程。既然最大堆空间已经增加,它很乐意在不执行主要垃圾收集的情况下使用所有堆空间。

Unfortunately, because of the change in server time, this makes the times on our memory usage graph unreliable. However, it seems to be that it tried to garbage collect, failed, increased the heap space to the maximum available size, and killed that thread all at once. Now that the maximum heap space has increased, it's happy to use all of it without performing a major garbage collection.

所以现在我问这个问题:如果服务器时间突然像它一样发生变化,那么这会导致垃圾收集过程出现问题吗?

So now I ask this: if the server time changes suddenly like it did, can that cause a problem with the garbage collection process?

推荐答案


然而,这个问题被注意到的唯一原因是因为它实际上因为用完了而崩溃了一个周末的内存,即它必须达到最大堆大小,而不是运行垃圾收集器。

However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

我认为您的诊断不正确。除非你的JVM出现严重问题,否则应用程序只会在之后抛出一个OOME 它刚刚运行完整的垃圾收集,并发现仍然没有有足够的免费堆来继续 *

I think your diagnosis is incorrect. Unless there is something seriously broken about your JVM, then the application will only throw an OOME after it has just run a full garbage collect, and discovered that it still doesn't have enough free heap to proceed*.

我怀疑这里发生的是以下一种或多种情况:

I suspect that what is going on here is one or more of the following:


  • 您的应用程序内存泄漏缓慢。每次重新启动应用程序时,都会回收泄漏的内存。因此,如果您在一周内定期重新启动应用程序,这可以解释为什么它只会在周末崩溃。

  • Your application has a slow memory leak. Each time you restart the application, the leaked memory gets reclaimed. So, if you restart the application regularly during the week, this could explain why it only crashes on the weekend.

您的应用程序正在执行需要不同金额的计算记忆完成。在那个周末,有人向它发送了一个需要更多可用内存的请求。

Your application is doing computations that require varying amounts of memory to complete. On that weekend, someone sent it a request that required more memory that was available.

手动运行GC是在任何一种情况下都没有真正解决问题。您需要做的是调查内存泄漏的可能性,并查看应用程序内存大小,以查看它是否足够大以执行正在执行的任务。

Running the GC by hand is not actually going to solve the problem in either case. What you need to do is to investigate the possibility of memory leaks, and also look at the application memory size to see if it is large enough for the tasks that are being performed.

如果您可以长时间捕获堆统计信息,则内存泄漏将显示为完全垃圾回收后可用内存量随时间的下降趋势。 (这是锯齿模式中最长的牙齿的高度。)与工作负荷相关的记忆短缺可能会在相对较短的时间内出现在同一测量中的偶然急剧下降趋势,随后是恢复。你可能会看到两者,然后你可能会发生这两件事。

If you can capture heap stats over a long period, a memory leak will show up as a downwards trend over time in the amount of memory available after full garbage collections. (That is the height of the longest "teeth" of the sawtooth pattern.) A workload-related memory shortage will probably show up as an occasional sharp downwards trend in the same measure over a relatively short period of time, followed by a recovery. You may see both, then you could have both things happening.

*实际上,决定何时放弃OOME的标准有点复杂比这个。它们依赖于某些JVM调优选项,并且可以包括运行GC所花费的时间百分比。

FOLLOWUP

@Ogre - 我需要更多关于你的应用程序的信息,以便能够以任何特异性回答这个问题(关于内存泄漏)。

@Ogre - I'd need a lot more information about your application to be able to answer that question (about memory leaks) with any specificity.

根据您的新证据,还有两种可能性:

With your new evidence, there are two further possibilities:


  • 您的申请可能会陷入困境由于时钟时间扭曲而导致内存泄漏的循环。

  • Your application may be getting stuck in a loop that leaks memory as a result of the clock time-warping.

时钟扭曲可能导致GC认为它太大运行时间的百分比并因此触发OOME。此行为取决于您的JVM设置。

The clock time-warping may cause the GC to think that it is taking too large a percentage of run time and trigger an OOME as a result. This behaviour depends on your JVM settings.

无论哪种方式,您都应该精益你的客户让他们停止那样调整系统时钟。 (32分钟的时间扭曲太多了!)。让他们安装系统服务,使时钟与网络时间保持同步(或更频繁)。重要的是,让他们使用带有选项的服务以小增量调整时钟。

Either way, you should lean hard on your client to get them to stop adjusting the system clock like that. (A 32 minute timewarp is way too much!!). Get them to install a system service to keep the clock in sync with network time hour by hour (or more frequent). Critically, get them to use a service with an option to adjusts the clock in small increments.

(重新发布第2篇:JVM中有一个GC监控机制测量JVM运行GC所花费的总时间百分比,相对于执行有用的工作而言。这是为了防止JVM在应用程序内存不足时停止运行。

(Re the 2nd bullet: there is a GC monitoring mechanism in the JVM that measures the percentage of overall time that the JVM is spending running the GC, relative to doing useful work. This is designed to prevent the JVM from grinding to a halt when your application is really running out of memory.

这种机制可以通过在不同点对壁挂时间进行采样来实现。但如果挂钟时间在关键时刻扭曲,很容易看出JVM如何认为特定的GC运行花费的时间比它实际上要长得多......并触发OOME。)

This mechanism would be implemented by sampling the wall-clock time at various points. But if the wall-clock time is timewarped at a critical point, it is easy to see how the JVM may think that a particular GC run took much longer than it actually did ... and trigger the OOME.)

这篇关于Java垃圾收集器 - 定期运行不正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆