在大文件上格式化多个字符串 [英] Grep multiple strings on large files
问题描述
我有大量的大型日志文件(每个日志文件大约200mb,总共有200GB数据)。
每10分钟,服务器写入日志文件约10K参数(带时间戳)。
首先,我使用grep和1个参数,然后 LC_ALL = C
使它更快一点,然后我用fgrep也稍快一点。然后我使用并行
parallel -j 2 --pipe --block 20M
最后,对于每200MB,我能够在5秒内提取1个参数。
但是当我在一个grep中输入多个参数时, n'param1 | param2 | ... | param100< log.txt
然后grep操作的时间线性增加(grep 1需要几分钟的时间文件现在)。 (请注意,我必须将egrep用于多个管道,不知何故,grep不喜欢它们)。
有没有更快/更好的方法来解决这个问题?
请注意,我不需要使用正则表达式,因为我正在寻找的模式是固定的。我只想提取包含特定字符串的特定行。
在反思上述评论时,我做了另一个测试。从 md5deep -rZ
命令(大小:319MB)删除了我的文件。
> time egrep'100 | fixed | strings'md5> / dev / null
时间
real 0m16.888s
用户0m16.714s
sys 0m0.172s
$ c $对于
时间fgrep -f 100_lines_patt_file md5> / dev / null
时间
real 0m1.379s
user 0m1.220s
sys 0m0.158s
比egrep快近15倍。
因此,当您在 egrep
和 fgrep
恕我直言,这就是说:
- IO将减缓
egrep的计算时间不会因处理器或内存而减慢,但IO和(恕我直言)因此您不需要使用 fgrep
获得速度提升。
I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).
Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.
First I used grep with 1 parameter, then LC_ALL=C
made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel
parallel -j 2 --pipe --block 20M
and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.
BUT.. when I pipe multiple parameters in one grep
parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt
then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).
Is there a faster/better way to solve this problem?
Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.
解决方案 In reflect to the above comments i done another test. Taked my file from md5deep -rZ
command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long).
The
time egrep '100|fixed|strings' md5 >/dev/null
time
real 0m16.888s
user 0m16.714s
sys 0m0.172s
for the
time fgrep -f 100_lines_patt_file md5 >/dev/null
the time is
real 0m1.379s
user 0m1.220s
sys 0m0.158s
Nearly 15times faster as egrep.
So, when you get only 0.3 sec improvment betwen egrep
and fgrep
IMHO thats mean:
- your IO is to slow
The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep
.
这篇关于在大文件上格式化多个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!