为什么 pcre regex 比 c++11 regex 快得多 [英] Why pcre regex is much faster than c++11 regex
问题描述
一些示例代码.这是使用 cregex_iterator 的 c++11 部分:
Some sample code. This is the c++11 part using cregex_iterator:
std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
regex re("<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>", regex::icase);
int found = 0;
for (std::cregex_iterator i = std::cregex_iterator(input, input + input_length, re);
i != std::cregex_iterator();
++i)
{
found++;
if (found < 10000) continue;
break;
}
std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();
这是pcre部分.正则表达式都是一样的.
This is the pcre part. The regexp is all the same.
std::chrono::steady_clock::time_point begin4 = std::chrono::steady_clock::now();
const char *pError = NULL;
int errOffset;
int options = PCRE_MULTILINE | PCRE_CASELESS;
const char* regexp = "<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>";
pcre* pPcre = pcre_compile(regexp, options, &pError, &errOffset, 0);
int offset = 0;
int matches = -1;
int pMatches[6];
while (offset < input_length)
{
matches = pcre_exec(pPcre,NULL, input, input_length, offset,0, pMatches,6);
if (matches >= 1)
{
found++;
offset = pMatches[1];
if (found < 10000) continue;
break; // find match
}
else
offset = input_length;
}
std::chrono::steady_clock::time_point end4 = std::chrono::steady_clock::now();
结果显示 pcre 比 c++11 快 100 倍.我在 c++11 实现中发现了一些向量复制和调整大小.还有其他原因吗?
The result shows pcre is 100 times faster than c++11. I found some vector copy and resize in c++11 implementation. Are there some other reasons?
推荐答案
PCRE 受益于一些被称为启动优化的优化,这些优化被配置为默认启用.这些优化包括:
PCRE benefits from some optimizations known as start-up optimizations which are configured to be enabled by default. These optimizations include:
- 主题预扫描未锚定的模式(如果起点是未找到引擎甚至不费心去进行匹配过程.)
- 研究模式以确保主题的最小长度不短于模式本身
- 自动占有
- 快速失败(如果特定点是未找到引擎甚至不费心去进行匹配过程.)
- A subject pre-scan for unanchored patterns (if a starting point is not found engine doesn't even bother to go through matching process.)
- Studying pattern to ensure that minimum length of subject is not shorter than pattern itself
- Auto-possessification
- Fast failure (if a specific point is not found engine doesn't even bother to go through matching process.)
表面模式分析:
<option # Subject pre-scan applied (unachored pattern)
[\\s]
value
[\\s]* # Auto-possessification applied (translates to \s*+)
=
[\\s]* # //
\"([^\">]*)\"
[\\s]* # //
[^>]*
> # Min length (17 chars) check of subject string applied
此外,如果输入字符串没有像>
这样的特殊字符,则应该抛出快速失败.您应该知道性能也可能严重依赖于输入字符串.
Furthermore, if input string doesn't have a special character like >
, a fast failure is supposed to be thrown. You should know that performance can depend on input string heavily as well.
在模式下运行:
(*NO_AUTO_POSSESS)(*NO_START_OPT)<option[\s]value[\s]*=[\s]*\"([^\">]*)\"[\s]*[^>]*>
在这个输入字符串上(观察那个时期):
over this input string (watch that period):
<option value .
并比较结果(现场演示).
这篇关于为什么 pcre regex 比 c++11 regex 快得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!