在大文件上格式化多个字符串 [英] Grep multiple strings on large files

查看:82
本文介绍了在大文件上格式化多个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的大型日志文件(每个日志文件大约200mb,总共有200GB数据)。

每10分钟,服务器写入日志文件约10K参数(带时间戳)。



首先,我使用grep和1个参数,然后 LC_ALL = C 使它更快一点,然后我用fgrep也稍快一点。然后我使用并行

  parallel -j 2 --pipe --block 20M 



最后,对于每200MB,我能够在5秒内提取1个参数。



但是当我在一个grep中输入多个参数时, n'param1 | param2 | ... | param100< log.txt

然后grep操作的时间线性增加(grep 1需要几分钟的时间文件现在)。 (请注意,我必须将egrep用于多个管道,不知何故,grep不喜欢它们)。

有没有更快/更好的方法来解决这个问题?



请注意,我不需要使用正则表达式,因为我正在寻找的模式是固定的。我只想提取包含特定字符串的特定行。

解决方案

在反思上述评论时,我做了另一个测试。从 md5deep -rZ 命令(大小:319MB)删除了我的文件。

  

> time egrep'100 | fixed | strings'md5> / dev / null

时间

  real 0m16.888s 
用户0m16.714s
sys 0m0.172s


 时间fgrep -f 

100_lines_patt_file md5> / dev / null

时间

  real 0m1.379s 
user 0m1.220s
sys 0m0.158s

比egrep快近15倍。

因此,当您在 egrep fgrep 恕我直言,这就是说:


  • IO将减缓


egrep的计算时间不会因处理器或内存而减慢,但IO和(恕我直言)因此您不需要使用 fgrep 获得速度提升。


I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).

Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.

First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel

parallel -j 2 --pipe --block 20M

and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.

BUT.. when I pipe multiple parameters in one grep

parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt

then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).

Is there a faster/better way to solve this problem?

Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.

解决方案

In reflect to the above comments i done another test. Taked my file from md5deep -rZ command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long).

The

time egrep '100|fixed|strings' md5 >/dev/null

time

real    0m16.888s
user    0m16.714s
sys     0m0.172s

for the

time fgrep -f 100_lines_patt_file md5 >/dev/null

the time is

real    0m1.379s
user    0m1.220s
sys     0m0.158s

Nearly 15times faster as egrep.

So, when you get only 0.3 sec improvment betwen egrep and fgrep IMHO thats mean:

  • your IO is to slow

The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep.

这篇关于在大文件上格式化多个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆