性能问题与grep -f [英] Performance issue with grep -f

查看：126 发布时间：2018/5/28 19:31:17 regex performance grep

本文介绍了性能问题与grep -f的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用grep在单个文件中搜索多个正则表达式。
特别是，我正在考虑包含英文字幕的100 MB文件并运行以下存储在文件 patterns.txt中的正则表达式：

：

Cas。* eharden acr。* aotic syn。* thesizing sub。* abbot iss。* acharite bot。* onne dis。* similatory ove。* rmantel isa。* tin ado。* nijah sol。* ution zei。* st fam。* ousness inq 。* uisitress aor。* tography via。*风管 ama。* sa der。* $ pie。* tas kit。* chenette

在这样做的过程中，我发现grep所需的时间并不随着常规的数量而线性增长表达式。事实上，它似乎成倍增长。

实验

系统：Intel（R）Core（TM）i5-5200U CPU @ 2.20GHz; 4个核心; 8 GB RAM

案例1：20个正则表达式

命令 grep -c -f patterns.txt subtitles.txt 计数2214个事件，并且需要

2,19s用户0,00s系统99％cpu 2,192 total。

案例2：30个正则表达式

如果我将以下表达式添加到 patterns.txt 中，

ort。* hros ove。* ridentify mis。* tiest pay。* ne int。*交换 jej。* uneness sta。* lactiform und。* ertrain cob。* bles 分类*类别
命令 grep -c -f patterns.txt subtitles.txt 计数2894次并发生71.35s用户0,06s系统99％CPU 1：11,42总数。

案例3：35个正则表达式

添加五个表达式：

dis。* embosom imp。* ortunateness ema。 * thion rho。* mb haz。* elwood
命令 grep -c -f patterns.txt subtitles.txt 计数290 4个事件，需要211,18个用户0,22个系统99％cpu 3：31,58个总共

为什么grep -f表现出这种行为？它在内部做什么？

我一直在使用的整套regexp可以在 here grep 源代码，看起来你的文件中的正则表达式不会一次执行一次。相反，它们被一次全部读入一个大的正则表达式中：

$ p codease'f'：
fp = STREQ （optarg， - ）？ stdin：fopen（optarg，O_TEXT？rt：r）;
if（！fp）
错误（EXIT_TROUBLE，errno，％s，optarg）; （keyalloc = 1; keyalloc< = keycc + 1; keyalloc * = 2）
;
keys = xrealloc（keys，keyalloc）;
oldcc = keycc;
while（（cc = fread（keys + keycc，1，keyalloc - 1 - keycc，fp））！= 0）
{
keycc + = cc;
if（keycc == keyalloc - 1）
keys = x2nrealloc（keys，& keyalloc，sizeof * keys）;
}

通过观看strace
open（testreg，O_RDONLY）= 3 fstat（ 3，{st_mode = S_IFREG | 0664，st_size = 124，...}）= 0 mmap（NULL，4096，PROT_READ | PROT_WRITE，MAP_PRIVATE | MAP_ANONYMOUS，-1，0）= 0x7fd8912fe000 read （3，ort。* hros\ ove。* ridentify \ nmis。* ti...，4096）= 124
回溯正则表达式实现（允许反向引用），不要在O（n）时间运行，而是运行O（2 ^ m），这可能导致 catastrophic runtimes。您的假设 grep 只是循环遍历每个正则表达式，将每个正则表达式编译成一个DFA然后执行它是非常合理的。然而，看起来 grep 作者认为通过一次运行您的正则表达式，他们可能在某些情况下更有效地执行。结果是，通过在你的文件中添加正则表达式，你落入了O（2 ^ m）运行时，导致你的运行时间呈指数增长。
事实证明，简单地循环遍历每个执行它们的正则表达式可能会更有效，从而迫使您的grep线性运行。在我的笔记本电脑上运行grep版本2.20，我只使用您提供的文件中最后29个正则表达式得到以下结果： [下载] $ wc -l patterns.txt 29 patterns.txt [下载] $ time grep -c -f〜/ Downloads / patterns.txt / usr / share / dict / linux.words 117 real 0m3.092s user 0m3.027s sys 0m0.047s [csd @ alcazar下载]在`cat〜/ Downloads / patterns.txt`中$正则表达式的时间; grep -c $ regex /usr/share/dict/linux.words>的/ dev / null的; done real 0m0.474s user 0m0.312s sys 0m0.158s I'm using grep to search for several regular expressions in a single file. In particular, I'm considering a 100 MB file with English subtitles and running the following regular expressions stored in file patterns.txt: Cas.*eharden acr.*otic syn.*thesizing sub.*abbot iss.*acharite bot.*onne dis.*similatory ove.*rmantel isa.*tin ado.*nijah sol.*ution zei.*st fam.*ousness inq.*uisitress aor.*tography via.*duct ama.*sa der.*ive pie.*tas kit.*chenette While doing so I observed that the time required by grep does not grow linearly with the number of regular expressions. Indeed, it seems to grow exponentially. Experiments System: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz; 4 Cores; 8 GB RAM Case 1: 20 regexps Command grep -c -f patterns.txt subtitles.txt counts 2214 occurrences and takes 2,19s user 0,00s system 99% cpu 2,192 total. Case 2: 30 regexps If I add the following expressions to patterns.txt ort.*hros ove.*ridentify mis.*tiest pay.*ne int.*erchasing jej.*uneness sta.*lactiform und.*ertrain cob.*bles Sub.*category Command grep -c -f patterns.txt subtitles.txt counts 2894 occurrences and takes 71,35s user 0,06s system 99% cpu 1:11,42 total. Case 3: 35 regexps Adding five more expressions: dis.*embosom imp.*ortunateness ema.*thion rho.*mb haz.*elwood Command grep -c -f patterns.txt subtitles.txt counts 2904 occurrences and takes 211,18s user 0,22s system 99% cpu 3:31,58 total Why grep -f exhibit such behavior? What is it doing internally? The whole set of regexp I've been using can be found here 解决方案 From reading the grep source code, it appears that the regex's in your file are not executed one at a time. Rather, they're read all at once into one big regex: case 'f': fp = STREQ (optarg, "-") ? stdin : fopen (optarg, O_TEXT ? "rt" : "r"); if (!fp) error (EXIT_TROUBLE, errno, "%s", optarg); for (keyalloc = 1; keyalloc <= keycc + 1; keyalloc *= 2) ; keys = xrealloc (keys, keyalloc); oldcc = keycc; while ((cc = fread (keys + keycc, 1, keyalloc - 1 - keycc, fp)) != 0) { keycc += cc; if (keycc == keyalloc - 1) keys = x2nrealloc (keys, &keyalloc, sizeof *keys); } This is confirmed watching an strace of grep run on your command: open("testreg", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0664, st_size=124, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd8912fe000 read(3, "ort.*hros\nove.*ridentify\nmis.*ti"..., 4096) = 124 Backtracking regex implementations (which allow backreferences), do not run in O(n) time, but rather O(2^m), which can lead to catastrophic runtimes. Your assumption that grep is simply looping over each regex in turn, compiling each into a DFA and then executing it is perfectly reasonable. However, it appears that the grep authors have assumed that by running your regex's all at once, they can probably do so more efficiently in some cases. The result is that by adding regex's to your file, you're falling into the O(2^m) runtime, resulting in an exponential increase in your runtime. As it turns out, it may be more efficient to simply loop over each regex executing them one at a time, forcing your grep to run linearly. On my laptop, running grep version 2.20, I get the following results using only the last 29 regex's in the file you provided: [Downloads]$ wc -l patterns.txt 29 patterns.txt [Downloads]$ time grep -c -f ~/Downloads/patterns.txt /usr/share/dict/linux.words 117 real 0m3.092s user 0m3.027s sys 0m0.047s [csd@alcazar Downloads]$ time for regex in `cat ~/Downloads/patterns.txt`; do grep -c $regex /usr/share/dict/linux.words > /dev/null; done real 0m0.474s user 0m0.312s sys 0m0.158s 这篇关于性能问题与grep -f的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

性能问题与grep -f [英] Performance issue with grep -f

问题描述

实验

案例1：20个正则表达式

案例2：30个正则表达式

案例3：35个正则表达式

Experiments

Case 1: 20 regexps

Case 2: 30 regexps

Case 3: 35 regexps

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

性能问题与grep -f [英] Performance issue with grep -f

问题描述

实验

案例1：20个正则表达式

案例2：30个正则表达式

案例3：35个正则表达式

Experiments

Case 1: 20 regexps

Case 2: 30 regexps

Case 3: 35 regexps

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭