如何为Apache Storm工作人员启用GC日志记录,同时防止日志文件覆盖并限制磁盘空间使用量 [英] How to enable GC logging for Apache Storm workers, while preventing log file overwrites and capping disk space usage

查看:1768
本文介绍了如何为Apache Storm工作人员启用GC日志记录,同时防止日志文件覆盖并限制磁盘空间使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我们决定为许多集群上的Apache Storm工作人员启用GC日志记录(确切版本各不相同),以帮助查看与拓扑相关的内存和垃圾回收问题。我们希望为工作人员做到这一点,但我们也希望避免我们知道可能发生的两个问题:


  • 覆盖日志文件时工作负载因任何原因重新启动

  • 日志使用的磁盘空间太多,导致磁盘空间不足(如果让群集运行时间足够长,日志文件将填满磁盘,除非被管理) li>


当某个进程启动Java GC日志记录时,它似乎将替换具有相同名称的任何文件的内容。这意味着,除非您小心,否则您将失去GC日志记录,可能当您最需要它时。

您可以通过storm.yaml中的worker.childopts属性为Storm工作人员设置JVM选项(如果您通过Apache Ambari管理Storm,请查看风暴服务>配置>高级风暴站点> worker.childopts)。您将添加额外的JVM属性。



要启用GC日志记录到文件,您需要添加 -verbose:gc -Xloggc :< log-file-location>



您需要特别注意日志文件名以防止覆盖。看起来你需要为每个调用都有一个唯一的名称。为了实现这一点,请利用风暴代码文档。为了唯一性,%WORKER-ID%就足够了,它对每个工作进程都是唯一的(很可能)是唯一的。您可能也希望能够轻松地确定GC日志的结构。在这种情况下,添加%TOPOLOGY-ID%(您可能需要说%ID%一些旧版本的Storm );它可能很长,但会提供拓扑名称。



到目前为止,JVM选项是 -verbose:gc -Xloggc:/ var / log / storm / storm-worker-%TOPOLOGY -ID% - %WORKER-ID%-gc.log - %TOPOLOGY-ID%是可选的,路径应该匹配您的Storm日志记录目录,并且您可以根据需要以不同的方式命名日志文件)。



现在管理磁盘空间的使用。我会很高兴,如果有一个简单的方式,我有什么。



首先,利用Java的内置GC日志文件轮换。 -XX:+ UseGCLogFileRotation -XX:NumberOfGCLogFiles = 10 -XX:GCLogFileSize = 10M 是启用此循环的一个示例,每个JVM有多达10个GC日志文件,每个其中不超过10MB的大小。 10 x 10MB最大使用量为100MB。请注意,这是每个工人实例。



通过最多10个文件,'.0','.1'... '.9'将被添加到您在Xloggc中给出的文件名中。 .0将是第一个,在达到.9之后,它将取代.0并以循环方式继续。在某些版本的Java中,.current将另外放在当前正在写入的日志文件的名称的末尾。



由于独特的文件命名,显然必须添加以避免覆盖,这意味着您可以为每个工作者进程调用100MB,所以这不是管理风暴工作人员子GC日志所用磁盘空间的完整解决方案。您将最终为每个过程最多包含10个GC日志文件 - 这可以加起来。最好的解决方案(在* nix下)似乎是使用logrotate工具
来定期清理在过去N天未被修改的工作GC日志。



请务必进行数学计算并确保您有足够的磁盘空间。

人们经常希望GC日志中的更多细节和上下文比默认情况下,请考虑添加 -XX:+ PrintGCDetails -XX:+ PrintGCTimeStamps -XX:+ PrintGCDateStamps



总之,您将在worker.childopts中添加如下内容: -verbose:gc -Xloggc:/ var / log / storm / storm-worker-%TOPOLOGY -ID% - %WORKER-ID %-gc.log -XX:+ UseGCLogFileRotation -XX:NumberOfGCLogFiles = 10 -XX:GCLogFileSize = 10M -XX:+ PrintGCDetails -XX:+ PrintGCTimeStamps -XX:+ PrintGCDateStamps 加上配置logrotate。 p>

最后,我应该提及一些其他的命名日志文件的选项,尽管我没有看到它的优点,至少对于我的用例来说:



    在某些版本的Java中,可以将%t放在GC日志文件命名中,并且Java将用当前时间戳替换格式为< YYYY> - < MM><< ; DD> _< HH> - < MM> - < SS> 。您也可以输出%p来获得当前的进程ID。有人告诉我,在某些情况下,您可以使用反向表达式,如'date +'%Y%m%d%H在Storm和Java的某些组合中,至少如果你使用Ambari,%M'`。他报告说,这与Storm 0.10.0和Java 1.7.0_95一起工作,但我无法通过风暴0.9.3.2.2.0.0-2041和java 1.7.0_75获得该行为。

We recently decided to enable GC logging for Apache Storm workers on a number of clusters (exact version varies) as a aid to looking into topology-related memory and garbage collection problems. We want to do that for workers, but we also want to avoid two problems we know might happen:

  • overwriting of the log file when a worker restarts for any reason
  • the logs using too much disk space, leading to disks getting filled (if you keep the cluster running long enough, log files will fill up disk unless managed)

When Java GC logging starts for a process it seems to replace the content of any file that has the same name. This means that unless you are careful, you will lose the GC logging, perhaps when you are most likely to need it.

解决方案

You can set JVM options for Storm workers via the worker.childopts property in storm.yaml (if you are managing Storm through Apache Ambari, look under Storm service > configs > advanced storm-site > worker.childopts). You will be adding additional JVM properties to that.

To enable GC logging to a file, you will need to add -verbose:gc -Xloggc:<log-file-location>.

You need to give the log file name special consideration to prevent overwrites. It seems like you need to have a unique name for every invocation. To achieve this, take advantage of some of special "%" string replacements mentioned in the Storm code documentation. For uniqueness, %WORKER-ID% is sufficient it is (quite likely) unique for each worker process. You may also want to be able to easily tell what topology the GC log is for. In that case add in %TOPOLOGY-ID% (you may need to say %ID% some older versions of Storm); it may be long but will provide the name of the topology.

So far the JVM options are -verbose:gc -Xloggc:/var/log/storm/storm-worker-%TOPOLOGY-ID%-%WORKER-ID%-gc.log (the -%TOPOLOGY-ID% is optional, the path should match your Storm logging directory, and you can name the log file differently if you prefer).

Now onto managing use of disk space. I'll be happy if there is a simpler way that what I have.

First, take advantage of Java's built-in GC log file rotation. -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M is an example of enabling this rotation, having up to 10 GC log files from the JVM, each of which is no more than 10MB in size. 10 x 10MB is 100MB max usage. Note that this is per worker instance.

With the GC log file rotation in place with up to 10 files, '.0', '.1', ... '.9' will be added to the file name you gave in Xloggc. .0 will be first and after it reaches .9 it will replace .0 and continue on in a round robin manner. In some versions of Java '.current' will be additionally put on the end of the name of the log file currently being written to.

Due to the unique file naming we apparently have to add to avoid overwrites, this means you can have 100MB per worker process invocation, so this is not a total solution to managing disk space used by storm worker child GC logs. You will end up with a set of up to 10 GC log files for each process -- this can add up. The best solution (under *nix) to that would seem to be to use the logrotate utility to periodically clean up worker GC logs that have not been modified in the last N days.

Be sure to do the math and make sure you will have enough disk space.

People frequently want more details and context in their GC logs than the default, so consider adding in -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps.

All together, you will be adding something like the following to worker.childopts: -verbose:gc -Xloggc:/var/log/storm/storm-worker-%TOPOLOGY-ID%-%WORKER-ID%-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps plus configure logrotate.

Finally, I should mention a couple other options for naming log files, though I don't see the advantage, at least for my use case:

  • in some versions of Java you can put %t in GC log file naming and Java will replace that with the current timestamp formatted as <YYYY>-<MM>-<DD>_<HH>-<MM>-<SS>. You can also out %p to get the current process ID.
  • Somebody told me that in some cases you can put backticked expressions such as `date +'%Y%m%d%H%M'` in some combinations of Storm and Java, at least if you use Ambari. He reported that that worked with Storm 0.10.0 and Java 1.7.0_95, but I was unable to get that behavior with storm 0.9.3.2.2.0.0-2041 and java 1.7.0_75.

这篇关于如何为Apache Storm工作人员启用GC日志记录,同时防止日志文件覆盖并限制磁盘空间使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆