在大文件上格式化多个字符串 [英] Grep multiple strings on large files

查看：82 发布时间：2018/5/28 19:46:03 bash unix logging grep large-files

本文介绍了在大文件上格式化多个字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量的大型日志文件（每个日志文件大约200mb，总共有200GB数据）。

每10分钟，服务器写入日志文件约10K参数（带时间戳）。

首先，我使用grep和1个参数，然后 LC_ALL = C 使它更快一点，然后我用fgrep也稍快一点。然后我使用并行

  parallel -j 2 --pipe --block 20M 
  
 
 
 最后，对于每200MB，我能够在5秒内提取1个参数。
 
 
 但是当我在一个grep中输入多个参数时， n'param1 | param2 | ... | param100< log.txt

然后grep操作的时间线性增加（grep 1需要几分钟的时间文件现在）。（请注意，我必须将egrep用于多个管道，不知何故，grep不喜欢它们）。

有没有更快/更好的方法来解决这个问题？

请注意，我不需要使用正则表达式，因为我正在寻找的模式是固定的。我只想提取包含特定字符串的特定行。
解决方案
在反思上述评论时，我做了另一个测试。从 md5deep -rZ 命令（大小：319MB）删除了我的文件。

> time egrep'100 | fixed | strings'md5> / dev / null

时间
real 0m16.888s 用户0m16.714s sys 0m0.172s 时间fgrep -f 100_lines_patt_file md5> / dev / null 时间 real 0m1.379s user 0m1.220s sys 0m0.158s 比egrep快近15倍。因此，当您在 egrep 和 fgrep 恕我直言，这就是说： IO将减缓 egrep的计算时间不会因处理器或内存而减慢，但IO和（恕我直言）因此您不需要使用 fgrep 获得速度提升。 I have a large number of large log files (each log file is around 200mb and I have 200GB data in total). Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file. First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel parallel -j 2 --pipe --block 20M and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds. BUT.. when I pipe multiple parameters in one grep parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them). Is there a faster/better way to solve this problem? Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string. 解决方案 In reflect to the above comments i done another test. Taked my file from md5deep -rZ command (size: 319MB). Randomly selected 100 md5 checksums (each 32chars long). The time egrep '100|fixed|strings' md5 >/dev/null time real 0m16.888s user 0m16.714s sys 0m0.172s for the time fgrep -f 100_lines_patt_file md5 >/dev/null the time is real 0m1.379s user 0m1.220s sys 0m0.158s Nearly 15times faster as egrep. So, when you get only 0.3 sec improvment betwen egrep and fgrep IMHO thats mean: your IO is to slow The computing time for egrep is not slowed by processor or memory but IO and (IMHO) therefore you don't get any speed improvement with fgrep. 这篇关于在大文件上格式化多个字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在大文件上格式化多个字符串 [英] Grep multiple strings on large files

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

在大文件上格式化多个字符串 [英] Grep multiple strings on large files

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭