性能问题与grep -f [英] Performance issue with grep -f

查看:126
本文介绍了性能问题与grep -f的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用grep在单个文件中搜索多个正则表达式。
特别是,我正在考虑包含英文字幕的100 MB文件并运行以下存储在文件 patterns.txt中的正则表达式:



  Cas。* eharden 
acr。* aotic
syn。* thesizing
sub。* abbot
iss。* acharite
bot。* onne
dis。* similatory
ove。* rmantel
isa。* tin
ado。* nijah
sol。* ution
zei。* st
fam。* ousness
inq 。* uisitress
aor。* tography
via。*风管
ama。* sa
der。* $
pie。* tas
kit。* chenette

在这样做的过程中,我发现grep所需的时间并不随着常规的数量而线性增长表达式。事实上,它似乎成倍增长



实验



系统:Intel(R)Core(TM)i5-5200U CPU @ 2.20GHz; 4个核心; 8 GB RAM

案例1:20个正则表达式



命令 grep -c -f patterns.txt subtitles.txt 计数2214个事件,并且需要

2,19s用户0,00s系统99%cpu 2,192 total。



案例2:30个正则表达式



如果我将以下表达式添加到 patterns.txt 中,

  ort。* hros 
ove。* ridentify
mis。* tiest
pay。* ne
int。*交换
jej。* uneness
sta。* lactiform
und。* ertrain
cob。* bles
分类*类别

命令 grep -c -f patterns.txt subtitles.txt 计数2894次并发生71.35s用户0,06s系统99%CPU 1:11,42总数。

案例3:35个正则表达式



添加五个表达式:

  dis。* embosom 
imp。* ortunateness
ema。 * thion
rho。* mb
haz。* elwood

命令 grep -c -f patterns.txt subtitles.txt 计数290 4个事件,需要211,18个用户0,22个系统99%cpu 3:31,58个总共

为什么grep -f表现出这种行为?它在内部做什么?



我一直在使用的整套regexp可以在 here grep 源代码,看起来你的文件中的正则表达式不会一次执行一次。相反,它们被一次全部读入一个大的正则表达式中:

$ p codease'f':
fp = STREQ (optarg, - )? stdin:fopen(optarg,O_TEXT?rt:r);
if(!fp)
错误(EXIT_TROUBLE,errno,%s,optarg); (keyalloc = 1; keyalloc< = keycc + 1; keyalloc * = 2)
;
keys = xrealloc(keys,keyalloc);
oldcc = keycc;
while((cc = fread(keys + keycc,1,keyalloc - 1 - keycc,fp))!= 0)
{
keycc + = cc;
if(keycc == keyalloc - 1)
keys = x2nrealloc(keys,& keyalloc,sizeof * keys);
}

通过观看 strace

  open(testreg,O_RDONLY)= 3 
fstat( 3,{st_mode = S_IFREG | 0664,st_size = 124,...})= 0
mmap(NULL,4096,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,-1,0)= 0x7fd8912fe000
read (3,ort。* hros\
ove。* ridentify \ nmis。* ti...,4096)= 124

回溯正则表达式实现(允许反向引用),不要在O(n)时间运行,而是运行O(2 ^ m),这可能导致 catastrophic runtimes。 您的假设 grep 只是循环遍历每个正则表达式,将每个正则表达式编译成一个DFA然后执行它是非常合理的。然而,看起来 grep 作者认为通过一次运行您的正则表达式,他们可能在某些情况下更有效地执行。结果是,通过在你的文件中添加正则表达式,你落入了O(2 ^ m)运行时,导致你的运行时间呈指数增长。



事实证明,简单地循环遍历每个执行它们的正则表达式可能会更有效,从而迫使您的grep线性运行。在我的笔记本电脑上运行grep版本2.20,我只使用您提供的文件中最后29个正则表达式得到以下结果:

  [下载] $ wc -l patterns.txt 
29 patterns.txt

[下载] $ time grep -c -f〜/ Downloads / patterns.txt / usr / share / dict / linux.words
117

real 0m3.092s
user 0m3.027s
sys 0m0.047s

[csd @ alcazar下载]在`cat〜/ Downloads / patterns.txt`中$正则表达式的时间; grep -c $ regex /usr/share/dict/linux.words>的/ dev / null的; done
real 0m0.474s
user 0m0.312s
sys 0m0.158s


I'm using grep to search for several regular expressions in a single file. In particular, I'm considering a 100 MB file with English subtitles and running the following regular expressions stored in file patterns.txt:

Cas.*eharden
acr.*otic
syn.*thesizing
sub.*abbot
iss.*acharite
bot.*onne
dis.*similatory
ove.*rmantel
isa.*tin
ado.*nijah
sol.*ution
zei.*st
fam.*ousness
inq.*uisitress
aor.*tography
via.*duct
ama.*sa
der.*ive
pie.*tas
kit.*chenette

While doing so I observed that the time required by grep does not grow linearly with the number of regular expressions. Indeed, it seems to grow exponentially.

Experiments

System: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz; 4 Cores; 8 GB RAM

Case 1: 20 regexps

Command grep -c -f patterns.txt subtitles.txt counts 2214 occurrences and takes
2,19s user 0,00s system 99% cpu 2,192 total.

Case 2: 30 regexps

If I add the following expressions to patterns.txt

ort.*hros
ove.*ridentify
mis.*tiest
pay.*ne
int.*erchasing
jej.*uneness
sta.*lactiform
und.*ertrain
cob.*bles
Sub.*category

Command grep -c -f patterns.txt subtitles.txt counts 2894 occurrences and takes 71,35s user 0,06s system 99% cpu 1:11,42 total.

Case 3: 35 regexps

Adding five more expressions:

dis.*embosom
imp.*ortunateness
ema.*thion
rho.*mb
haz.*elwood

Command grep -c -f patterns.txt subtitles.txt counts 2904 occurrences and takes 211,18s user 0,22s system 99% cpu 3:31,58 total

Why grep -f exhibit such behavior? What is it doing internally?

The whole set of regexp I've been using can be found here

解决方案

From reading the grep source code, it appears that the regex's in your file are not executed one at a time. Rather, they're read all at once into one big regex:

case 'f':
  fp = STREQ (optarg, "-") ? stdin : fopen (optarg, O_TEXT ? "rt" : "r");
  if (!fp)
    error (EXIT_TROUBLE, errno, "%s", optarg);
  for (keyalloc = 1; keyalloc <= keycc + 1; keyalloc *= 2)
    ;
  keys = xrealloc (keys, keyalloc);
  oldcc = keycc;
  while ((cc = fread (keys + keycc, 1, keyalloc - 1 - keycc, fp)) != 0)
    {
      keycc += cc;
      if (keycc == keyalloc - 1)
        keys = x2nrealloc (keys, &keyalloc, sizeof *keys);
    }

This is confirmed watching an strace of grep run on your command:

open("testreg", O_RDONLY)               = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=124, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd8912fe000
read(3, "ort.*hros\nove.*ridentify\nmis.*ti"..., 4096) = 124

Backtracking regex implementations (which allow backreferences), do not run in O(n) time, but rather O(2^m), which can lead to catastrophic runtimes.

Your assumption that grep is simply looping over each regex in turn, compiling each into a DFA and then executing it is perfectly reasonable. However, it appears that the grep authors have assumed that by running your regex's all at once, they can probably do so more efficiently in some cases. The result is that by adding regex's to your file, you're falling into the O(2^m) runtime, resulting in an exponential increase in your runtime.

As it turns out, it may be more efficient to simply loop over each regex executing them one at a time, forcing your grep to run linearly. On my laptop, running grep version 2.20, I get the following results using only the last 29 regex's in the file you provided:

[Downloads]$ wc -l patterns.txt 
29 patterns.txt

[Downloads]$ time grep -c -f ~/Downloads/patterns.txt /usr/share/dict/linux.words 
117

real    0m3.092s
user    0m3.027s
sys     0m0.047s

[csd@alcazar Downloads]$ time for regex in `cat ~/Downloads/patterns.txt`; do grep -c $regex /usr/share/dict/linux.words > /dev/null; done
real    0m0.474s
user    0m0.312s
sys     0m0.158s

这篇关于性能问题与grep -f的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆