工作集穗只是OutOfMemoryException异常前 [英] WorkingSet Spike just before OutOfMemoryException

查看:206
本文介绍了工作集穗只是OutOfMemoryException异常前的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我调查,其中一个OutOfMemoryException已经被扔在生产,一个传统.NET服务器应用程序的事件。我的目的是要跨preT的数据的特定部分通过性能监视器收集并寻求如何推动一些建议。让我先用事实清单:

I am investigating an incident where an OutOfMemoryException has been thrown in production, for a "traditional" .NET server application. My purpose is to interpret a specific portion of the data gathered through Performance Monitor and seek some advice on how to move on. Let me start with a list of facts:

  1. 在该过程已运行的 20天以上,直至崩溃。
  2. 在坠毁原因的异常类型的的System.OutOfMemoryException 被抛出。
  3. 有过类似的事件,在过去。同样的,它需要一个很长一段时间作为应用程序崩溃。
  4. 在这个过程已经通过性能监视器监视通过以下计数器:字节数在所有堆,%处理器时间,专用字节,工作集
  5. 我们的不能捕捉任何内存转储在生产环境中,我们一直无法重现它。
  1. The process had been running for over 20 days until the crash.
  2. It crashed because an exception of type System.OutOfMemoryException was thrown.
  3. There have been similar incidents in the past. Similarly, it takes a long time for the application to crash.
  4. The process had been monitored through Performance Monitor by the following counters: # Bytes in all Heaps, % Processor Time, Private Bytes, Working Set.
  5. We cannot capture any memory dumps at the production environment, and we haven't been able to reproduce it.

在第一个屏幕,你可以看到在7天跨度专柜的整体行为。事情是pretty的多稳定。第二个屏幕截图显示的行为在最后一分钟,飞机坠毁左右。该OutOfMemoryException异常已登录的下午3点13分49秒

In the first screenshot, you can see the overall behavior of the counters in a span of 7 days. Things are pretty much stable. The second screenshot shows the behavior over the last minute, around the crash. The OutOfMemoryException was logged on 3:13:49PM.

我的问题是: 1.任何想法是什么工作集的突然增加呢?这是整体稳定在650ish MB,并在10秒钟爬上1,3GB。 2.我应该专注于寻找的东西,引发的OOM 只是在飞机坠毁前,或可以是累计系数?正如你所看到的,专用字节数和字节的所有堆是pretty的多稳定。

My questions are: 1. Any ideas what does the sudden increase of Working Set mean? It was overall stable at 650ish MB, and in 10 seconds it climbed up to 1,3GB. 2. Should I focus on finding something that triggered the OOM just before the crash, or could it be an accumulative factor? As you've seen, Private Bytes and Bytes on all Heaps are pretty much stable.

推荐答案

这些类型的问题是非常难以诊断。它很可能是正在发生的事情是不是一个单一的条件触发的行为的结果,而是一组同步条件

These kinds of problems are exceedingly difficult to diagnose. It is quite possible that what is happening is not the result of a single condition that triggers the behaviour, but a set of simultaneous conditions.

下面是我们所知道的:

  1. 没有累积的问题表示:如果该问题是累积性的,我们希望看到的是领导到事件的20天期限的一些迹象。这并不意味着将preceding操作可以忽略。这可能是一些触发行为的条件被分级,并开始较早前。这是我们无法知道可用的信息。

  1. No cumulative problem indicated: If the problem was cumulative, we would expect to see some sign of that of the 20 day period leading up to the event. This does not mean that the preceding operation can be ignored. It is possible that some of the conditions that trigger the behaviour are staged, and start earlier on. This is something we cannot know with the information available.

堆是稳定的:专用字节衡量告诉我们有多少内存已被保留(不感动,因为stephbu建议)。字节功能于全堆告诉我们有多少保留的内存是目前根据内存管理器(GC)分配。因为这两种都是稳定的,这似乎是,问题不一定内存泄漏。危险的是,我们只有10秒有趣的数据,而且由于GC通常是相当被动的,目前尚不清楚如何准确的统计数据是(特别是与靠不住的工作集)。

Heaps are stable: The Private Bytes measure tells us how much memory has been reserved (not touched, as stephbu suggested). Bytes-in-all-Heaps tells us how much of the reserved memory is current allocated according to the memory manager (GC). Since both of these are stable, it would seem that the problem isn't necessarily a memory leak. The danger is that we only have 10 seconds of interesting data, and since GC is usually fairly passive, it isn't clear how accurate those statistics would be (particular with the wonky working set).

工作组表示抖动:工作集告诉我们OS多少物理内存要保持分页中,以确保合理的性能。越来越多的工作组表示抖动。越来越多的工作组通常与两件事有关:

Working set indicates thrashing: The Working Set tells us how much physical memory the OS wants to keep paged-in to ensure reasonable performance. A growing working set indicates thrashing. A growing working set is normally associated with two things:

  • 提高分配率

  • increased allocation rate

增加对象寿命(通常是暂时的)

increased object longevity (often temporary)

增加长寿的对象没有显示,因为堆并不呈现增长态势。提高分配率是可能的,但对象仍然是短暂的(因为泄漏未标明)。

Increased object longevity is not indicated, because the heaps are not showing growth. Increased allocation rate is possible, but the objects are still short-lived (since a leak is not indicated).

这些意见建议,我认为一些稀有事件(或一组事件)的触发条件,其中有:

These observations suggest to me that some rare event (or set of events) is triggering a condition in which there is:

  • 高分配率

  • a high allocation rate

中等大小的物体

这是不是很长的寿命

GC是颠簸,结果

有<一href="http://stackoverflow.com/questions/11974694/avoiding-outofmemoryexception-during-large-fast-and-frequent-memory-allocations">other报告这些条件造成OutOfMemoryExceptions。我没有那么一定为什么它会发生。如果您运行的是32位环境中,那么一个可能的原因是地址空间碎片。这可能发生,如果GC无法从操作系统连续的页面。

There are other reports of these conditions causing OutOfMemoryExceptions. I'm not all that certain why it happens. If you are running a 32-bit environment, then a possible reason is fragmentation of the address space. This can happen if the GC cannot obtain contiguous pages from the OS.

另一种可能性(这我无法验证)是GC的请求操作系统不是正在堆的首页输出部分。如果锁定的页面数变高,一个彻头彻尾的内存可能会导致。这个想法是几乎完全的炒作,因为我不知道有足够的了解的微软GC执行的内部。

Another possibility (which I cannot verify) is that the GC requests the OS to not page-out parts of the heap it is working on. If the number of locked pages gets high, an out-of-memory might result. This idea is almost total speculation as I do not know enough about the internals of Microsofts GC implementation.

我没有什么更好的解释的权利,但我肯定会是一个更好的解释,如果任何人都可以提供的。

I don't have any better explanations right now, but I'd definitely like a better explanation if anyone can provide one.

最后,你可能想验证一个合理的延迟模式已启用。如果这是问题,我想我们会看到字节功能于全堆的升级 - 所以它可能是确定

Finally, you might want to verify that a reasonable Latency Mode is enabled. If this was the problem, I think we would have seen an escalation of Bytes-in-all-Heaps -- so it's probably ok.

PS

您可以检查哪些变量由虚线在第二个图所示?如果它是处理器使用,那么它是与颠簸一致。那样需要页入内容更频繁增加,磁盘IO应增加,和(在某点)百分比处理器使用应拒绝,因为一切都等待磁盘。这仅仅是一个额外的细节 - 如果处理器的使用不会下降过多,颠簸仍然是一个posibility。这是因为该软件的部分可能仍显示出良好的局部性,并能够取得进展。

Can you check what variable is indicated by the dashed line in the second chart? If it is processor use, then it is consistent with thrashing. As the need to page-in content more frequently increases, disk IO should increase, and (at a certain point) percentage processor use should decline, because everything is waiting for the disk. This is just an extra detail -- if the processor use doesn't decline excessively, thrashing is still a posibility. This is because parts of the software might still exhibit good locality and be able to make progress.

这篇关于工作集穗只是OutOfMemoryException异常前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆