C ++ .NET VS正则表达式的性能 [英] C++ vs .NET regex performance

查看:163
本文介绍了C ++ .NET VS正则表达式的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由康拉德·鲁道夫在一个相关的问题,我写了下面的程序,以基准定期恩$ P在F#$ pssion性能:

 开System.Text.RegularEx pressions
让海峡= System.IO.File.ReadAllTextC:\\用户\\ \\乔恩文件\\ pg10.txt
让我们再次= System.IO.File.ReadAllTextC:\\用户\\ \\乔恩文件\\ re.txt
对于_在1..3做
  让定时器= System.Diagnostics.Stopwatch.StartNew()
  让我们再次=正则表达式(RE,RegexOptions.Compiled)
  让水库= Array.Parallel.init 4(乐趣_  - > re.Split STR |> Seq.sumBy(乐趣米 - > m.Length))
  printfn%A%FS资源timer.Elapsed.TotalSeconds
 

和等值的C ++:

 的#includestdafx.h中

#包括< WINDOWS.H>
#包括<正则表达式>
#包括<载体>
#包括<字符串>
#包括< fstream的>
#包括< cstdio>
#包括< codeCVT>

使用名字空间std;

wstring的负载(wstring的文件名){
    常量区域empty_locale =区域::空();
    的typedef codecvt_utf8< wchar_t的> converter_type;
    常量converter_type *转换器=新converter_type;
    常量区域utf8_locale =区域(empty_locale,转换器);
    wifstream在(文件名);
    wstring的内容;
    如果(在)
    {
        in.seekg(0,内部监督办公室::月底);
        contents.resize(in.tellg());
        in.seekg(0,内部监督办公室::求);
        in.read(安培;内容[0],contents.size());
        附寄();
    }
    返回(内容);
}

诠释计数(常量wstring的&放大器;再次,常量wstring的&放大器; S){
    静态常量wregex ​​rsplit(重);
    汽车RIT = wsregex_token_iterator(s.begin(),s.end(),rsplit,-1);
    汽车撕裂= wsregex_token_iterator();
    诠释计数= 0;
    为(自动它= RIT;它=撕裂;!++吧)
        数+ = IT->长度();
    返回计数;
}

INT _tmain(INT ARGC,_TCHAR * argv的[])
{
    wstring的海峡=负载(Lpg10.txt);
    wstring的重新=负载(Lre.txt);

    __int64频率,T开始,TSTOP;
    无符号长TIMEDIFF;
    QueryPerformanceFrequency的((LARGE_INTEGER *)及频率);
    QueryPerformanceCounter的((LARGE_INTEGER *)及TSTART);

    矢量< int的>水库(4);

OMP的#pragma NUM_THREADS并联(4)
    为(自动I = 0; I< res.size(); ++ I)
        水库[I] =数(重,STR);

    QueryPerformanceCounter的((LARGE_INTEGER *)及TSTOP);
    TIMEDIFF =(无符号长)(((TSTOP  -  T开始)* 1000000)/频率);
    的printf((%D,%D,%D,%D)%FS \ N,RES [0],资源[1],资源[2],资源[3],TIMEDIFF / 1e6个);
    返回0;
}
 

这两个程序加载两个文件作为单code字符串(我用圣经的副本),构建一个不平凡的UNI code正则表达式 \ W?\ W? \瓦特?\瓦特?\瓦特?\瓦特以及使用该正则表达式返回(为了避免分配)分割字符串的长度的总和分割字符串四次平行。

同时运行在Visual Studio(与MP和OpenMP启用了C ++)在针对64位发布版本中,C ++需要43.5s和F#需要3.28s(超过13倍快)。这并不让我感到吃惊,因为我相信.NET JIT编译的正则表达式来本地code,而C ++ STDLIB除$ P $点,但我想一些同行评审。

有没有在我的C ++ code PERF的一个bug或者这是编译vs国际preTED经常EX pressions的结果?

修改:比利·奥尼尔指出,.NET可以有 \不同的跨pretation是W 所以我已明确在新的正则表达式:

  [0-9A-ZA-Z _] [0-9A-ZA-Z _] [0-9A-ZA-Z _] [0-9A-ZA-Z_ ] [0-9A-ZA-Z _] [0-9A-ZA-Z_]
 

这实际上使得.NET code大大快(C ++是一样的),减少了F#从3.28s时间2.38s(超过17倍快)。

解决方案

这些基准是不是真的具有可比性 - C ++和.NET实现完全不同的普通恩pression语言(ECMAScript中与Perl的),并接通电源通过完全不同的普通恩pression引擎。 .NET(我的理解)是受益于<一个href="http://research.microsoft.com/en-us/downloads/bd99f343-4ff4-4041-8293-34c054efe749/default.aspx">GRETA这里的项目,产生一个绝对精彩的定期EX pression已调整为年引擎。在C ++ 的std ::正则表达式的比较是最近才加入(至少在MSVC ++,这我假设你正在使用给出的非标准类型 __int64 和朋友)。

您可以看到GRETA是如何做到与更成熟的的std ::正则表达式实施的boost ::正则表达式这里:<一href="http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/vc71-performance.html">http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/vc71-performance.html (尽管有人做过试验,在Visual Studio 2003中)。

您还应该记住,正则表达式的表现是高度依赖于你的源字符串,并在你的正则表达式。一些正则表达式引擎花费大量的时间解析正则表达式通过更多的源文本走得更快;一个折衷的才有意义,如果你正在分析大量文字。一些正则表达式引擎权衡的扫描速度是相对昂贵的进行匹配(这样的比赛数量将有效果)。有权衡这里庞大的数字; 1对投入真的是要云的故事。

因此​​,为了更明确地回答你的问题:这种变化是正常的跨越正则表达式引擎,无论是编译或跨preTED。看着提升的测试上面,往往是最快和最慢的实现之间的差异有上百次不同 - 17X是不是所有的怪取决于你的使用情况

Prompted by a comment from Konrad Rudolph on a related question, I wrote the following program to benchmark regular expression performance in F#:

open System.Text.RegularExpressions
let str = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\pg10.txt"
let re = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\re.txt"
for _ in 1..3 do
  let timer = System.Diagnostics.Stopwatch.StartNew()
  let re = Regex(re, RegexOptions.Compiled)
  let res = Array.Parallel.init 4 (fun _ -> re.Split str |> Seq.sumBy (fun m -> m.Length))
  printfn "%A %fs" res timer.Elapsed.TotalSeconds

and the equivalent in C++:

#include "stdafx.h"

#include <windows.h>
#include <regex>
#include <vector>
#include <string>
#include <fstream>
#include <cstdio>
#include <codecvt>

using namespace std;

wstring load(wstring filename) {
    const locale empty_locale = locale::empty();
    typedef codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const locale utf8_locale = locale(empty_locale, converter);
    wifstream in(filename);
    wstring contents;
    if (in)
    {
        in.seekg(0, ios::end);
        contents.resize(in.tellg());
        in.seekg(0, ios::beg);
        in.read(&contents[0], contents.size());
        in.close();
    }
    return(contents);
}

int count(const wstring &re, const wstring &s){
    static const wregex rsplit(re);
    auto rit = wsregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = wsregex_token_iterator();
    int count=0;
    for (auto it=rit; it!=rend; ++it)
        count += it->length();
    return count;
}

int _tmain(int argc, _TCHAR* argv[])
{
    wstring str = load(L"pg10.txt");
    wstring re = load(L"re.txt");

    __int64 freq, tStart, tStop;
    unsigned long TimeDiff;
    QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
    QueryPerformanceCounter((LARGE_INTEGER *)&tStart);

    vector<int> res(4);

#pragma omp parallel num_threads(4)
    for(auto i=0; i<res.size(); ++i)
        res[i] = count(re, str);

    QueryPerformanceCounter((LARGE_INTEGER *)&tStop);
    TimeDiff = (unsigned long)(((tStop - tStart) * 1000000) / freq);
    printf("(%d, %d, %d, %d) %fs\n", res[0], res[1], res[2], res[3], TimeDiff/1e6);
    return 0;
}

Both programs load two file as unicode strings (I'm using a copy of the Bible), construct a non-trivial unicode regex \w?\w?\w?\w?\w?\w and split the string four times in parallel using the regex returning the sum of the lengths of the split strings (in order to avoid allocation).

Running both in Visual Studio (with MP and OpenMP enabled for the C++) in release build targeting 64-bit, the C++ takes 43.5s and the F# takes 3.28s (over 13x faster). This does not surprise me because I believe .NET JIT compiles the regex to native code whereas the C++ stdlib interprets it but I'd like some peer review.

Is there a perf bug in my C++ code or is this a consequence of compiled vs interpreted regular expressions?

EDIT: Billy ONeal has pointed out that .NET can have a different interpretation of \w so I have made it explicit in a new regex:

[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]

This actually makes the .NET code substantially faster (C++ is the same), reducing the time from 3.28s to 2.38s for F# (over 17x faster).

解决方案

These benchmarks aren't really comparable -- C++ and .NET implement completely different regular expression languages (ECMAScript vs. Perl), and are powered by completely different regular expression engines. .NET (to my understanding) is benefiting from the GRETA project here, which produced an absolutely fantastic regular expression engine which has been tuned for years. The C++ std::regex in comparison is a recent addition (at least on MSVC++, which I'm assuming you're using given the nonstandard types __int64 and friends).

You can see how GRETA did vs. a more mature std::regex implementation, boost::regex, here: http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/vc71-performance.html (though that test was done on Visual Studio 2003).

You also should keep in mind that regex performance is highly dependent on your source string and on your regex. Some regex engines spend lots of time parsing the regex to go faster through more source text; a tradeoff that makes sense only if you are parsing lots of text. Some regex engines trade off scanning speed for being relatively expensive to make matches (so number of matches would have an effect). There are huge numbers of tradeoffs here; one pair of inputs really is going to cloud the story.

So to answer your question more explicitly: this kind of variation is normal across regex engines, be they compiled or interpreted. Looking at boost's tests above, often the difference between the fastest and slowest implementations were hundreds of times different -- 17x isn't all that strange depending on your use case.

这篇关于C ++ .NET VS正则表达式的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆