为什么 pcre regex 比 c++11 regex 快得多 [英] Why pcre regex is much faster than c++11 regex

查看:99
本文介绍了为什么 pcre regex 比 c++11 regex 快得多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一些示例代码.这是使用 cregex_iterator 的 c++11 部分:

Some sample code. This is the c++11 part using cregex_iterator:

std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
regex re("<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>", regex::icase);
int found = 0;
for (std::cregex_iterator i = std::cregex_iterator(input, input + input_length, re);
i != std::cregex_iterator();
    ++i)
{
    found++;
    if (found < 10000) continue;
    break;
}
std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();

这是pcre部分.正则表达式都是一样的.

This is the pcre part. The regexp is all the same.

std::chrono::steady_clock::time_point begin4 = std::chrono::steady_clock::now();
const char *pError = NULL;
int errOffset;
int options = PCRE_MULTILINE | PCRE_CASELESS;
const char* regexp = "<option[\\s]value[\\s]*=[\\s]*\"([^\">]*)\"[\\s]*[^>]*>";
pcre* pPcre = pcre_compile(regexp, options, &pError, &errOffset, 0);                
int offset = 0;
int matches = -1;
int pMatches[6];
while (offset < input_length)
{
    matches = pcre_exec(pPcre,NULL, input, input_length, offset,0, pMatches,6); 
    if (matches >= 1)
    {
        found++;
        offset = pMatches[1];
        if (found < 10000) continue;
        break;  // find match
    }
    else
        offset = input_length;
}

std::chrono::steady_clock::time_point end4 = std::chrono::steady_clock::now();

结果显示 pcre 比 c++11 快 100 倍.我在 c++11 实现中发现了一些向量复制和调整大小.还有其他原因吗?

The result shows pcre is 100 times faster than c++11. I found some vector copy and resize in c++11 implementation. Are there some other reasons?

推荐答案

PCRE 受益于一些被称为启动优化的优化,这些优化被配置为默认启用.这些优化包括:

PCRE benefits from some optimizations known as start-up optimizations which are configured to be enabled by default. These optimizations include:

  1. 主题预扫描未锚定的模式(如果起点是未找到引擎甚至不费心去进行匹配过程.)
  2. 研究模式以确保主题的最小长度不短于模式本身
  3. 自动占有
  4. 快速失败(如果特定点是未找到引擎甚至不费心去进行匹配过程.)
  1. A subject pre-scan for unanchored patterns (if a starting point is not found engine doesn't even bother to go through matching process.)
  2. Studying pattern to ensure that minimum length of subject is not shorter than pattern itself
  3. Auto-possessification
  4. Fast failure (if a specific point is not found engine doesn't even bother to go through matching process.)

表面模式分析:

<option             # Subject pre-scan applied (unachored pattern)
    [\\s]
    value
    [\\s]*          # Auto-possessification applied (translates to \s*+)
    =
    [\\s]*          # //
    \"([^\">]*)\"   
    [\\s]*          # //
    [^>]*
>                   # Min length (17 chars) check of subject string applied

此外,如果输入字符串没有像>这样的特殊字符,则应该抛出快速失败.您应该知道性能也可能严重依赖于输入字符串.

Furthermore, if input string doesn't have a special character like >, a fast failure is supposed to be thrown. You should know that performance can depend on input string heavily as well.

在模式下运行:

(*NO_AUTO_POSSESS)(*NO_START_OPT)<option[\s]value[\s]*=[\s]*\"([^\">]*)\"[\s]*[^>]*>

在这个输入字符串上(观察那个时期):

over this input string (watch that period):

<option value                                                                 .

并比较结果(现场演示).

这篇关于为什么 pcre regex 比 c++11 regex 快得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆