管理分布在多台计算机上的大量日志文件 [英] Managing a Large Number of Log Files Distributed Over Many Machines

查看:98
本文介绍了管理分布在多台计算机上的大量日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已经开始使用第三方平台(GigaSpaces),该平台可帮助我们进行分布式计算.我们现在要解决的主要问题之一是如何在此分布式环境中管理日志文件.当前,我们有以下设置.

We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently.

我们的平台分布在8台计算机上.在每台计算机上,我们都有12-15个进程,这些进程使用java.util.logging记录为单独的日志文件.在该平台之上,我们拥有自己的应用程序,这些应用程序使用log4j和log来分隔文件.我们还将stdout重定向到一个单独的文件中,以捕获线程转储和类似内容.

Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We also redirect stdout to a separate file to catch thread dumps and similar.

这将导致大约200个不同的日志文件.

This results in about 200 different log files.

到目前为止,我们还没有工具来协助管理这些文件.在以下情况下,这会引起我们严重的头痛.

As of now we have no tooling to assist in managing these files. In the following cases this causes us serious headaches.

  • 当我们事先不知道问题是在哪个过程中发生时进行的故障排除.在这种情况下,我们当前使用ssh登录到每台计算机,并开始使用grep.

通过定期检查日志中是否有异常来尝试变得主动.在这种情况下,我们目前还登录到所有计算机,并使用lesstail查看不同的日志.

Trying to be proactive by regularly checking the logs for anything out of the ordinary. In this case we also currently log in to all machines and look at different logs using less and tail.

设置警报.我们希望为超过阈值的事件设置警报.要检查200个日志文件似乎很麻烦.

Setting up alerts. We are looking to setup alerts on events over a threshold. This is looking to be a pain with 200 log files to check.

今天我们每秒只有大约5个日志事件,但是随着我们将越来越多的代码迁移到新平台,这一事件将会增加.

Today we have only about 5 log events per second, but that will increase as we migrate more and more code to the new platform.

我想问社区以下问题.

  • 您如何处理许多通过不同框架记录的多台机器上分布的许多日志文件的类似案例?
  • 您为什么选择该特定解决方案?
  • 您的解决方案是如何实现的?您发现什么好,发现什么坏了?

非常感谢.

更新

我们最终评估了Splunk的试用版.我们对它的工作方式感到非常满意,并决定购买它.易于设置,快速搜索以及针对技术倾向的大量功能.我可以推荐处于类似情况的任何人进行检查.

We ended up evaluating a trial version of Splunk. We are very happy with how it works and have decided to purchase it. Easy to set up, fast searches and a ton of features for the technically inclined. I can recommend anyone in similar situations to check it out.

推荐答案

我建议将所有Java日志通过管道传输到 Java简单日志记录外观(SLF4J),然后将所有日志从SLF4J重定向到 LogBack . SLF4J对处理所有流行的旧版API(log4j,commons-logging,java.util.logging等)具有特殊支持,请参见

I would recommend to pipe all your java logging to Simple Logging Facade for Java (SLF4J) and then redirect all logs from SLF4J to LogBack. SLF4J has special support for handling all popular legacy APIs (log4j, commons-logging, java.util.logging, etc), see here.

一旦您在LogBack中拥有日志,就可以使用它的众多附加程序之一在多台计算机上汇总日志,有关详细信息,请参见手册

Once you have your logs in LogBack you can use one of it's many appenders to aggregate logs over several machines, for details, see the manual section about appenders. Socket, JMS and SMTP seem to be the most obvious candidates.

LogBack还具有内置的支持,可以监视日志文件中的特殊情况以及发送到特定附加程序的过滤事件.因此,您可以设置SMTP附加程序,以便每次日志中出现ERROR级别的事件时都向您发送一封电子邮件.

LogBack also has built-in support for monitoring for special conditions in log files and filtering events sent to particular appender. So you could set up SMTP appender to send you an e-mail every time there is an ERROR level event in logs.

最后,为了简化故障排除,请确保为所有传入的请求"添加某种 requestID ,请参阅我对

Finally, to ease troubleshooting, be sure to add some sort of requestID to all your incoming "requests", see my answer to this question for details.

编辑:您还可以实现自己的自定义LogBack附加程序,并将所有日志重定向到抄写员.

EDIT: you could also implement your own custom LogBack appender and redirect all logs to Scribe.

这篇关于管理分布在多台计算机上的大量日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆