性能问题与解析用awk,grep的大型日志文件(5GB〜),sed的 [英] Performance issue with parsing large log files (~5gb) using awk, grep, sed

查看:1569
本文介绍了性能问题与解析用awk,grep的大型日志文件(5GB〜),sed的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理与大小约日志文件。 5GB。我是很新的分析日志文件,并使用UNIX的bash,所以我会尽量为precise越好。同时,通过日志文件搜索,我做到以下几点:提供查找请求数,然后有选择地提供行动作为辅助过滤器。一个典型的命令如下:

I am currently dealing with log files with sizes approx. 5gb. I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible. While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter. A typical command looks like this:

fgrep '2064351200' example.log | fgrep 'action: example'

这是罚款处理小文件,但是这是一个5GB的日志文件,这是不能忍受缓慢。我在网上看了它的伟大使用awk或者sed来提高性能(或者甚至可能两者的组合),但我不知道这是怎么完成的。例如,使用awk中,我有一个典型的命令:

This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow. I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished. For example, using awk, I have a typical command:

awk '/2064351200/ {print}' example.log

基本上我的最终目标是能够打印/返回记录(或行号),包含字符串(可达到4-5个,和我读过的管道是坏的),以在日志文件中匹配效率

Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.

在一个侧面说明,在bash shell中,如果我想用awk,并做一些处理,如何实现这一点实现的?例如:

On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved? For example:

BEGIN { print "File\tOwner" }
{ print $8, "\t", \
$3}
END { print " - DONE -" }

这是pretty简单的awk脚本,我会假设有把它变成单行bash命令的方法吗?但我不知道该结构是如何的。

That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command? But I'm not sure how the structure is.

在此先感谢您的帮助。干杯。

Thanks in advance for the help. Cheers.

推荐答案

您需要执行一些测试来找出你的瓶颈,你的各种工具如何快速执行。尝试一些测试是这样的:

You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:

time fgrep '2064351200' example.log >/dev/null
time egrep '2064351200' example.log >/dev/null
time sed -e '/2064351200/!d' example.log >/dev/null
time awk '/2064351200/ {print}' example.log >/dev/null

传统地,egrep的应该是最快的一群(是的,比fgrep一样快)的,但一些现代的实现是自适应和自动切换到最适当的搜索算法。如果你有bmgrep(使用博耶 - 穆尔搜索算法),尝试。一般情况下,sed和因为它们设计成更通用的文字处理工具,而不是调整搜索的具体工作awk的速度会变慢。但它确实依赖于实现,并找出正确的方法是运行测试。运行它们每个几次,所以你不要被类似的东西缓存和竞争进程搞砸了。

Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.

由于@Ron指出,搜索过程可能是磁盘I / O绑定。如果将搜索相同的日志文件中的一些的时候,它可以是PSS日志文件第一速度更快的COM $ P $;这使得它更快地从磁盘读取,但需要更多的CPU时间来处理,因为它必须是DECOM pressed第一。尝试是这样的:

As @Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:

compress -c example2.log >example2.log.Z
time zgrep '2064351200' example2.log.Z >/dev/null
gzip -c example2.log >example2.log.gz
time zgrep '2064351200' example2.log.gz >/dev/null
bzip2 -k example.log
time bzgrep '2064351200' example.log.bz2 >/dev/null

我只是跑了一个快速测试一个相当COM pressible文本文件,发现bzip2的COM pressed最好的,但后来拿了远远更多的CPU时间来DECOM preSS,所以zgip选项伤口高达是最快的整体。您的计算机将有比我的不同的磁盘和CPU的性能,因此您的结果可能会有所不同。如果您有任何其他COM pressors躺在身边,试戴为好,并且/或者尝试不同层次的gzip COM pression等。

I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.

preprocessing说起:如果你正在寻找相同的登录一遍又一遍,有没有办法preSELECT出来只是你的日志行可能的有兴趣?如果是这样,用grep出来到一个较小的(也许COM pressed)文件,然后搜索的,而不是整个事情。与COM pression,你花一些额外的时间达阵,但每个人的搜索运行速度更快。

Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.

有关管道的说明:在其他条件相同,通过多个管道的命令一个巨大的文件会比有一个命令慢做的所有工作。但是,所有的事情都是不相等这里,如果在管道中使用多个命令(这是zgrep和bzgrep做)你买更好的整体性能,为它去。另外,还要考虑是否实际上是传递所有的数据在整个管道。在你给,比fgrep'2064351200'example.log的例子|比fgrep'动作:例如,第一fgrep一样将放弃大部分的文件;管和第二命令只需要处理包含2064351200的日志的小部分,所以减速将可能是微不足道的。

A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.

TL;!医生测试所有的东西。

tl;dr TEST ALL THE THINGS!

编辑:如果日志文件(正在增加,即新的条目)活,但它的大部分是静态的,您可以使用部分preprocess方法:COM preSS( &安培;也许preSCAN)日志,然后扫描时使用COM pressed(安培;因为你做了preSCAN / prescanned)版本加日志的部分的尾部添加。事情是这样的:

if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:

# Precompress:
gzip -v -c example.log >example.log.gz
compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')

# Search the compressed file + recent additions:
{ gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'

如果你打算做几个相关的搜索(例如特定的请求,那么这项要求的具体行动),可以节省prescanned版本:

If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:

# Prescan for a particular request (repeat for each request you'll be working with):
gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log

# Search the prescanned file + recent additions:
{ cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'

这篇关于性能问题与解析用awk,grep的大型日志文件(5GB〜),sed的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆