性能问题与grep -f [英] Performance issue with grep -f
问题描述
我使用grep在单个文件中搜索多个正则表达式。
特别是,我正在考虑包含英文字幕的100 MB文件并运行以下存储在文件 patterns.txt中的正则表达式:
Cas。* eharden
acr。* aotic
syn。* thesizing
sub。* abbot
iss。* acharite
bot。* onne
dis。* similatory
ove。* rmantel
isa。* tin
ado。* nijah
sol。* ution
zei。* st
fam。* ousness
inq 。* uisitress
aor。* tography
via。*风管
ama。* sa
der。* $
pie。* tas
kit。* chenette
在这样做的过程中,我发现grep所需的时间并不随着常规的数量而线性增长表达式。事实上,它似乎成倍增长。
实验
系统:Intel(R)Core(TM)i5-5200U CPU @ 2.20GHz; 4个核心; 8 GB RAM 命令 如果我将以下表达式添加到 patterns.txt 中, 命令 添加五个表达式: 命令 为什么grep -f表现出这种行为?它在内部做什么? 我一直在使用的整套regexp可以在 here 通过观看 回溯正则表达式实现(允许反向引用),不要在O(n)时间运行,而是运行O(2 ^ m),这可能导致 catastrophic runtimes。 您的假设 事实证明,简单地循环遍历每个执行它们的正则表达式可能会更有效,从而迫使您的grep线性运行。在我的笔记本电脑上运行grep版本2.20,我只使用您提供的文件中最后29个正则表达式得到以下结果: I'm using grep to search for several regular expressions in a single file.
In particular, I'm considering a 100 MB file with English subtitles and running the following regular expressions stored in file patterns.txt: While doing so I observed that the time required by grep does not grow linearly with the number of regular expressions. Indeed, it seems to grow exponentially. System: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz; 4 Cores; 8 GB RAM Command If I add the following expressions to patterns.txt Command Adding five more expressions: Command Why grep -f exhibit such behavior? What is it doing internally? The whole set of regexp I've been using can be found here From reading the This is confirmed watching an Backtracking regex implementations (which allow backreferences), do not run in O(n) time, but rather O(2^m), which can lead to catastrophic runtimes. Your assumption that As it turns out, it may be more efficient to simply loop over each regex executing them one at a time, forcing your grep to run linearly. On my laptop, running grep version 2.20, I get the following results using only the last 29 regex's in the file you provided:
这篇关于性能问题与grep -f的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
案例1:20个正则表达式
grep -c -f patterns.txt subtitles.txt
计数2214个事件,并且需要
2,19s用户0,00s系统99%cpu 2,192 total。
案例2:30个正则表达式
ort。* hros
ove。* ridentify
mis。* tiest
pay。* ne
int。*交换
jej。* uneness
sta。* lactiform
und。* ertrain
cob。* bles
分类*类别
grep -c -f patterns.txt subtitles.txt
计数2894次并发生71.35s用户0,06s系统99%CPU 1:11,42总数。
案例3:35个正则表达式
dis。* embosom
imp。* ortunateness
ema。 * thion
rho。* mb
haz。* elwood
grep -c -f patterns.txt subtitles.txt
计数290 4个事件,需要211,18个用户0,22个系统99%cpu 3:31,58个总共
grep
源代码,看起来你的文件中的正则表达式不会一次执行一次。相反,它们被一次全部读入一个大的正则表达式中:
$ p codease'f':
fp = STREQ (optarg, - )? stdin:fopen(optarg,O_TEXT?rt:r);
if(!fp)
错误(EXIT_TROUBLE,errno,%s,optarg); (keyalloc = 1; keyalloc< = keycc + 1; keyalloc * = 2)
;
keys = xrealloc(keys,keyalloc);
oldcc = keycc;
while((cc = fread(keys + keycc,1,keyalloc - 1 - keycc,fp))!= 0)
{
keycc + = cc;
if(keycc == keyalloc - 1)
keys = x2nrealloc(keys,& keyalloc,sizeof * keys);
}
strace $ c
open(testreg,O_RDONLY)= 3
fstat( 3,{st_mode = S_IFREG | 0664,st_size = 124,...})= 0
mmap(NULL,4096,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,-1,0)= 0x7fd8912fe000
read (3,ort。* hros\
ove。* ridentify \ nmis。* ti...,4096)= 124
grep
只是循环遍历每个正则表达式,将每个正则表达式编译成一个DFA然后执行它是非常合理的。然而,看起来 grep
作者认为通过一次运行您的正则表达式,他们可能在某些情况下更有效地执行。结果是,通过在你的文件中添加正则表达式,你落入了O(2 ^ m)运行时,导致你的运行时间呈指数增长。
[下载] $ wc -l patterns.txt
29 patterns.txt
[下载] $ time grep -c -f〜/ Downloads / patterns.txt / usr / share / dict / linux.words
117
real 0m3.092s
user 0m3.027s
sys 0m0.047s
[csd @ alcazar下载]在`cat〜/ Downloads / patterns.txt`中$正则表达式的时间; grep -c $ regex /usr/share/dict/linux.words>的/ dev / null的; done
real 0m0.474s
user 0m0.312s
sys 0m0.158s
Cas.*eharden
acr.*otic
syn.*thesizing
sub.*abbot
iss.*acharite
bot.*onne
dis.*similatory
ove.*rmantel
isa.*tin
ado.*nijah
sol.*ution
zei.*st
fam.*ousness
inq.*uisitress
aor.*tography
via.*duct
ama.*sa
der.*ive
pie.*tas
kit.*chenette
Experiments
Case 1: 20 regexps
grep -c -f patterns.txt subtitles.txt
counts 2214 occurrences and takes
2,19s user 0,00s system 99% cpu 2,192 total.Case 2: 30 regexps
ort.*hros
ove.*ridentify
mis.*tiest
pay.*ne
int.*erchasing
jej.*uneness
sta.*lactiform
und.*ertrain
cob.*bles
Sub.*category
grep -c -f patterns.txt subtitles.txt
counts 2894 occurrences and takes 71,35s user 0,06s system 99% cpu 1:11,42 total.Case 3: 35 regexps
dis.*embosom
imp.*ortunateness
ema.*thion
rho.*mb
haz.*elwood
grep -c -f patterns.txt subtitles.txt
counts 2904 occurrences and takes 211,18s user 0,22s system 99% cpu 3:31,58 totalgrep
source code, it appears that the regex's in your file are not executed one at a time. Rather, they're read all at once into one big regex:case 'f':
fp = STREQ (optarg, "-") ? stdin : fopen (optarg, O_TEXT ? "rt" : "r");
if (!fp)
error (EXIT_TROUBLE, errno, "%s", optarg);
for (keyalloc = 1; keyalloc <= keycc + 1; keyalloc *= 2)
;
keys = xrealloc (keys, keyalloc);
oldcc = keycc;
while ((cc = fread (keys + keycc, 1, keyalloc - 1 - keycc, fp)) != 0)
{
keycc += cc;
if (keycc == keyalloc - 1)
keys = x2nrealloc (keys, &keyalloc, sizeof *keys);
}
strace
of grep run on your command:open("testreg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=124, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd8912fe000
read(3, "ort.*hros\nove.*ridentify\nmis.*ti"..., 4096) = 124
grep
is simply looping over each regex in turn, compiling each into a DFA and then executing it is perfectly reasonable. However, it appears that the grep
authors have assumed that by running your regex's all at once, they can probably do so more efficiently in some cases. The result is that by adding regex's to your file, you're falling into the O(2^m) runtime, resulting in an exponential increase in your runtime.[Downloads]$ wc -l patterns.txt
29 patterns.txt
[Downloads]$ time grep -c -f ~/Downloads/patterns.txt /usr/share/dict/linux.words
117
real 0m3.092s
user 0m3.027s
sys 0m0.047s
[csd@alcazar Downloads]$ time for regex in `cat ~/Downloads/patterns.txt`; do grep -c $regex /usr/share/dict/linux.words > /dev/null; done
real 0m0.474s
user 0m0.312s
sys 0m0.158s