C ++与.NET regex性能 [英] C++ vs .NET regex performance

查看:291
本文介绍了C ++与.NET regex性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Konrad Rudolph对相关问题发表的评论中提出,我编写了以下程序,以便对正则表达式的性能进行基准测试F#:

 打开System.Text.RegularExpressions 
let str = System.IO.File.ReadAllTextC: \\\Users\\Jon\\Documents\\pg10.txt
let re = System.IO.File.ReadAllTextC:\\Users\\Jon\\ \\\Documents\\re.txt
for _ in 1..3 do
let timer = System.Diagnostics.Stopwatch.StartNew()
let re = Regex(re ,RegexOptions.Compiled)
let res = Array.Parallel.init 4(fun _ - > re.Split str |> Seq.sumBy(fun m - > m.Length))
printfn %A%fsres timer.Elapsed.TotalSeconds

和C ++中的等效项:

  #includestdafx.h

#include< windows.h>
#include< regex>
#include< vector>
#include< string>
#include< fstream>
#include< cstdio>
#include< codecvt>

using namespace std;

wstring load(wstring filename){
const locale empty_locale = locale :: empty();
typedef codecvt_utf8< wchar_t> converter_type;
const converter_type * converter = new converter_type;
const locale utf8_locale = locale(empty_locale,converter);
wifstream in(filename);
wstring content;
if(in)
{
in.seekg(0,ios :: end);
contents.resize(in.tellg());
in.seekg(0,ios :: beg);
in.read(& contents [0],content.size());
in.close();
}
return(contents);
}

int count(const wstring& re,const wstring& s){
static const wregex rsplit(re);
auto rit = wsregex_token_iterator(s.begin(),s.end(),rsplit,-1);
auto rend = wsregex_token_iterator();
int count = 0;
for(auto it = rit; it!= rend; ++ it)
count + = it-> length();
return count;
}

int _tmain(int argc,_TCHAR * argv [])
{
wstring str = load(Lpg10.txt);
wstring re = load(Lre.txt);

__int64 freq,tStart,tStop;
unsigned long TimeDiff;
QueryPerformanceFrequency((LARGE_INTEGER *)& freq);
QueryPerformanceCounter((LARGE_INTEGER *)& tStart);

vector< int> res(4);

#pragma omp parallel num_threads(4)
for(auto i = 0; i res [i] = count re,str);

QueryPerformanceCounter((LARGE_INTEGER *)& tStop);
TimeDiff =(unsigned long)(((tStop-tStart)* 1000000)/ freq);
printf((%d,%d,%d,%d)%fs \\\
,res [0],res [1],res [2],res [3],TimeDiff / 1e6 );
return 0;
}

两个程序都将两个文件加载为unicode字符串圣经),构造一个非平凡的unicode regex \w?\w?\w?\w?\w?\w



使用正则表达式运行这两个在Visual Studio中(使用MP和OpenMP为C ++启用)在64位版本构建中,C ++需要43.5s和F#需要3.28s(超过13倍更快)。这并不让我感到惊讶,因为我相信.NET JIT将正则表达式编译为本地代码,而C ++ stdlib解释它,但我想要一些同行评审。



编辑:Billy ONeal已经指出.NET可以有不同的解释 \w 所以我已经使它在一个新的正则表达式:

 [0-9A-Za-z_]→[0-9A-Za-z_]→[0-9A-Za-z_]→[0-9A-Za-z-] 9A-Za-z _]?[0-9A-Za-z_] 

NET代码实际上更快(C ++是相同的),减少时间从3.28s到2.38s的F#(超过17倍更快)。

解决方案

这些基准测试并不真正可比 - C ++和.NET实现完全不同的正则表达式语言(ECMAScript vs. Perl),并且由完全不同的正则表达式引擎提供支持。 .NET(据我理解)受益于 GRETA项目在这里,这产生了绝对梦幻般的正则表达式引擎已经调整了多年。比较的C ++ std :: regex 是一个最近的添加(至少在MSVC ++,我假设你使用给定非标准类型 __int64 和朋友)。



您可以看到GRETA如何与更成熟的 std :: regex 实现, boost :: regex ,其中: http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/vc71-performance.html (虽然该测试是在Visual Studio 2003)。



您还应该记住,regex性能高度依赖于源字符串和正则表达式。一些正则表达式引擎花费大量的时间来解析正则表达式,以更快地通过更多的源文本;一个折衷,只有当你解析了很多文本才有意义。一些正则表达式引擎折衷扫描速度,因为比较昂贵的匹配(所以匹配的数量会有影响)。这里有大量的权衡;



因此,为了更明确地回答你的问题:这种变化在正则表达式引擎中是正常的,无论是编译还是解释。看看上面的boost的测试,通常最快和最慢的实现之间的差异是几百倍的不同 - 17x不是所有的奇怪,这取决于你的用例。


Prompted by a comment from Konrad Rudolph on a related question, I wrote the following program to benchmark regular expression performance in F#:

open System.Text.RegularExpressions
let str = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\pg10.txt"
let re = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\re.txt"
for _ in 1..3 do
  let timer = System.Diagnostics.Stopwatch.StartNew()
  let re = Regex(re, RegexOptions.Compiled)
  let res = Array.Parallel.init 4 (fun _ -> re.Split str |> Seq.sumBy (fun m -> m.Length))
  printfn "%A %fs" res timer.Elapsed.TotalSeconds

and the equivalent in C++:

#include "stdafx.h"

#include <windows.h>
#include <regex>
#include <vector>
#include <string>
#include <fstream>
#include <cstdio>
#include <codecvt>

using namespace std;

wstring load(wstring filename) {
    const locale empty_locale = locale::empty();
    typedef codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const locale utf8_locale = locale(empty_locale, converter);
    wifstream in(filename);
    wstring contents;
    if (in)
    {
        in.seekg(0, ios::end);
        contents.resize(in.tellg());
        in.seekg(0, ios::beg);
        in.read(&contents[0], contents.size());
        in.close();
    }
    return(contents);
}

int count(const wstring &re, const wstring &s){
    static const wregex rsplit(re);
    auto rit = wsregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = wsregex_token_iterator();
    int count=0;
    for (auto it=rit; it!=rend; ++it)
        count += it->length();
    return count;
}

int _tmain(int argc, _TCHAR* argv[])
{
    wstring str = load(L"pg10.txt");
    wstring re = load(L"re.txt");

    __int64 freq, tStart, tStop;
    unsigned long TimeDiff;
    QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
    QueryPerformanceCounter((LARGE_INTEGER *)&tStart);

    vector<int> res(4);

#pragma omp parallel num_threads(4)
    for(auto i=0; i<res.size(); ++i)
        res[i] = count(re, str);

    QueryPerformanceCounter((LARGE_INTEGER *)&tStop);
    TimeDiff = (unsigned long)(((tStop - tStart) * 1000000) / freq);
    printf("(%d, %d, %d, %d) %fs\n", res[0], res[1], res[2], res[3], TimeDiff/1e6);
    return 0;
}

Both programs load two file as unicode strings (I'm using a copy of the Bible), construct a non-trivial unicode regex \w?\w?\w?\w?\w?\w and split the string four times in parallel using the regex returning the sum of the lengths of the split strings (in order to avoid allocation).

Running both in Visual Studio (with MP and OpenMP enabled for the C++) in release build targeting 64-bit, the C++ takes 43.5s and the F# takes 3.28s (over 13x faster). This does not surprise me because I believe .NET JIT compiles the regex to native code whereas the C++ stdlib interprets it but I'd like some peer review.

Is there a perf bug in my C++ code or is this a consequence of compiled vs interpreted regular expressions?

EDIT: Billy ONeal has pointed out that .NET can have a different interpretation of \w so I have made it explicit in a new regex:

[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]

This actually makes the .NET code substantially faster (C++ is the same), reducing the time from 3.28s to 2.38s for F# (over 17x faster).

解决方案

These benchmarks aren't really comparable -- C++ and .NET implement completely different regular expression languages (ECMAScript vs. Perl), and are powered by completely different regular expression engines. .NET (to my understanding) is benefiting from the GRETA project here, which produced an absolutely fantastic regular expression engine which has been tuned for years. The C++ std::regex in comparison is a recent addition (at least on MSVC++, which I'm assuming you're using given the nonstandard types __int64 and friends).

You can see how GRETA did vs. a more mature std::regex implementation, boost::regex, here: http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/vc71-performance.html (though that test was done on Visual Studio 2003).

You also should keep in mind that regex performance is highly dependent on your source string and on your regex. Some regex engines spend lots of time parsing the regex to go faster through more source text; a tradeoff that makes sense only if you are parsing lots of text. Some regex engines trade off scanning speed for being relatively expensive to make matches (so number of matches would have an effect). There are huge numbers of tradeoffs here; one pair of inputs really is going to cloud the story.

So to answer your question more explicitly: this kind of variation is normal across regex engines, be they compiled or interpreted. Looking at boost's tests above, often the difference between the fastest and slowest implementations were hundreds of times different -- 17x isn't all that strange depending on your use case.

这篇关于C ++与.NET regex性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆