正则表达式导致堆栈溢出 [英] Regular Expression causing Stack Overflow

查看:340
本文介绍了正则表达式导致堆栈溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

针对我上一个问题:用于多字符串的ECMAScript Regex ,我已经实现了以下加载过程:

  void Load(const std :: string& szFileName)
{
static const std :: regex regexObject(===([^ =] +)=== \\\\
((?:。| \\\\
)*)\\\\
== = END \\1 ===,std :: regex_constants :: ECMAScript | std :: regex_constants :: optimize);
static const std :: regex regexData(<([^>] +)>:([^<] *)\\\\
,std :: regex_constants :: ECMAScript | std :: regex_constants :: optimize);

std :: ifstream inFile(szFileName);
inFile.exceptions(std :: ifstream :: badbit);

std :: string szFileData((std :: istreambuf_iterator< char>(inFile)),(std :: istreambuf_iterator< char>()))

inFile.close();

std :: vector< std :: future< void>> vecFutures;

for(std :: sregex_iterator itObject(szFileData.cbegin(),szFileData.cend(),regexObject),end; itObject!= end; ++ itObject)
{
if((* itObject)[1] ==OBJECT1)
{
vecFutures.emplace_back(std :: async([(std :: string szDataString){
for std :: sregex_iterator itData(szDataString.cbegin(),szDataString.cend(),regexData){// Do Stuff}
},(* itObject)[2] .str }
else if((* itObject)[1] ==OBJECT2)
{
vecFutures.emplace_back(std :: async [[(std :: string szDataString){
for(std :: sregex_iterator itData(szDataString.cbegin(),szDataString.cend(),regexData){// Do Stuff}
},(* itObject)[2] .str ;
}
}

for(auto& future:vecFutures)
{
future.get();
} $ b $但是,加载此文件会导致Stack Overflow(参数:0x00000001,0x00332FE4) :

  === OBJECT2 === 
< Name>:Test Manufacturer
< Supplier> ;:测试供应商
<地址>:测试多行
联系人
地址
<电子邮件>:test@test.co.uk
<电话号码> :0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
< Number>:1
< Name>:Test
< ;位置>:这里
<制造商>:
<型号>:12345
<序列号>:54321
< Owner>:Me
< IP地址>:0.0.0.0
=== END OBJECT1 ===

我一直无法找到Stack Overflow的源,但它看起来像外部 std :: sregex_iterator loop负责。



提前感谢!

解决方案

p>

  ===([^ =] +)=== \\\
((?:(?! ===) ^ \\\
] + \\\
)+)=== END \1 ===

在你的C ++中它显然会写成:

  ===([^ =] +)=== \ \\\
((?:(?! ===)[^ \\\\
] + \\\\
)+)=== END \\1 ===

这是为了最小的回溯(至少在匹配时),虽然我现在有点Tired-Face先生,可能会错过很多方法来改善它。



它使两个假设,用于避免大量的回溯堆栈溢出,如其他人所说):


  1. 从来没有 ===
  2. C ++支持这些正则表达式特性 - 特别是使用负值预期( ?!)。应该考虑使用ECMAScript方言。

解释:

  ===([^ =] +)=== \\\

匹配并捕获对象开始标记。 [^ =] 是避免相对少量回溯的一种方法,与你的一样 - 我们不使用 [^] / code>,因为我不知道OBJECT id中是否有空格。

  

开始捕获数据组。在其中,一个非捕获组,因为我们要

 (?!===)

否定前瞻 - 我们不想在捕获行开始处 === >

  [^ \\\
] + \\\

单独匹配一行。

 )+)

在开始和结束标记之间至少匹配一行,然后捕获单个组中的所有行。

  === END \1 === 

匹配结束标记。



比较(使用RegexBuddy)

原始版本:




  • 第一场比赛:1277步

  • 失败的匹配:1个步骤(这是由于对象之间的换行符)

  • 第二个匹配:396个步骤



每个添加的对象将导致前面的步骤数量增加。例如,添加一个对象(对象2的副本,重命名为3)将导致:2203步,1322步,425步。






  • 第一场比赛:67步

  • 比赛失败:1步

  • 比赛失败:1步


  • 第三场比赛:67步


Further to my previous question: ECMAScript Regex for a multilined string, I have implemented the following loading procedure:

void Load( const std::string& szFileName )
{
     static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
     static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );

     std::ifstream inFile( szFileName );
     inFile.exceptions( std::ifstream::badbit );

     std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );

     inFile.close();

     std::vector<std::future<void>> vecFutures;

     for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
     {
          if( (*itObject)[1] == "OBJECT1" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
          else if( (*itObject)[1] == "OBJECT2" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
     }

     for( auto& future : vecFutures )
     {
          future.get();
     }
}

However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4):

=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:test@test.co.uk
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===

I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator loop is responsible.

Thanks in advance!

解决方案

Here's another attempt:

=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===

In your C++ it would obviously be written as:

=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===

It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it.

It makes two assumptions, which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said):

  1. That there's never a === at the start of a line, except for the start/end marker lines.
  2. That C++ supports these regex features - specifically the use of a negative lookahead (?!). It should, considering it's ECMAScript dialect.

Explained:

=== ([^=]+) ===\n

Match and capture the object start marker. The [^=] is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ], because I do not know if there may be spaces in the OBJECT id.

((?:

Start capturing group for data. Inside it, a non-capturing group, because we're going to match each line individually.

   (?!===)

Negative lookahead - we don't want === at the start of our captured line.

   [^\n]+\n

Matches one line individually.

)+)

Match at least one line between start and end markers, then capture ALL the lines in a single group.

=== END \1 ===

Match the end marker.

Comparison (using RegexBuddy):

Original version:

  • First match: 1277 steps
  • Failed match: 1 step (this is due to the line break between the objects)
  • Second match: 396 steps

Every added object will cause the amount of steps to grow for the previous ones. E.g., adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps.

This version:

  • First match: 67 steps
  • Failed match: 1 step (once again due to the line break between the objects)
  • Second match: 72 steps
  • Failed match: 1 step
  • Third match: 67 steps

这篇关于正则表达式导致堆栈溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆