为什么这种不匹配的模式的性能随搜索空间的大小而缩放? [英] Why does the performance of this non-matching pattern scale with the size of the search space?

查看:91
本文介绍了为什么这种不匹配的模式的性能随搜索空间的大小而缩放?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码如下:

  #include< regex> 

int main()
{
char buf [35000] = {};
自动开始= std :: cbegin(buf),结束= std :: cend(buf);

std :: cmatch组;
std :: regex :: flag_type标志= std :: regex_constants :: ECMAScript;
std :: regex re(R REGEX(^ [\x02-\x7f] \0 .. [\x01-\x0c] \0..\0\0) REGEX,标志);
std :: regex_search(begin,end,groups,re);
}



调查中,我为该缓冲区大小插入了不同的值,发现



(小= 100,大= 35000,巨大= 100,000;未锚定是模式中省略了 ^



结果为



std :: regex 不在默认情况下处于多行模式下(对吗?),所以我以为锚点( ^ )可以防止这种恶作剧。在搜索字符串的开头找不到模式?不返回任何匹配项,工作已完成。



libc ++和libstdc ++似乎都发生了。



我缺少什么对这个?



明显的解决方法包括给 std :: regex_search 只是缓冲区的前缀(大的前缀)足以包含所有可能的匹配项,但不超过必需的范围),或仅以其他方式检查字符串。但是我很好奇。






FWIW,具有几乎等效的JavaScript测试用例,OSX上的Safari 12.0






为免生疑问,我使用了像 ^ what 和获得了相同的结果。如果可以提高动机,以后可能会更新上述示例以保持一致。 :)

解决方案

这仅仅是因为libstdc ++和libc ++没有实现这种优化。



以下是 libstdc ++实现的 regex_search 的主要部分

  template< typename _BiIter,typename _Alloc,typename _TraitsT,
bool __dfs_mode>
bool _Executor< _BiIter,_Alloc,_TraitsT,__ dfs_mode> ::
_M_search()
{
如果(_M_search_from_first())
返回true;
if(_M_flags& regex_constants :: match_continuous)
返回false;
_M_flags | = regex_constants :: match_prev_avail;
而(_M_begin!= _M_end)
{
++ _ M_begin;
如果(_M_search_from_first())
返回true;
}
返回false;
}

因此,即使正则表达式包含<$ c,它也会遍历整个范围$ c> ^



libc ++的实现


I have code like the following:

#include <regex>

int main()
{
   char buf[35000] = {};
   auto begin = std::cbegin(buf), end = std::cend(buf);

   std::cmatch groups;
   std::regex::flag_type flags = std::regex_constants::ECMAScript;
   std::regex re(R"REGEX(^[\x02-\x7f]\0..[\x01-\x0c]\0..\0\0)REGEX", flags);
   std::regex_search(begin, end, groups, re);
}

… and noticed that it performed suspiciously slowly.

Investigating, I plugged in different values for that buffer size, and found that the search gets slower as the buffer expands:

(small=100, large=35000, huge=100000; "unanchored" is with ^ omitted from the pattern)

The results are as I'd expect if I provide an input that actually matches the pattern:

std::regex is not in multiline mode by default (right?), so I'd have thought that anchor (^) would have prevented such shenanigans. Pattern not found at the start of the search string? Return no match, job done.

Seems to happen with both libc++ and libstdc++.

What am I missing about this? What's causing it?

Obvious workarounds include giving std::regex_search just a prefix of the buffer (a prefix large enough to encompass all possible matches but no more than necessary), or just examining the string in some other way. But I'm curious.


FWIW, with a near-equivalent JavaScript testcase, Safari 12.0 on OSX works as I'd expect, with only a very small variation between the three searches (which I'm attributing to random factors) and no obvious pattern:


For the avoidance of doubt, I retested with a regex as simple as ^what and got the same results. Might update the above examples later for coherence if I can work up the motivation. :)

解决方案

It's simply because libstdc++ and libc++ do not implement such optimization.

The following is the main part of libstdc++'s implementation of regex_search:

template<typename _BiIter, typename _Alloc, typename _TraitsT,
     bool __dfs_mode>
  bool _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
  _M_search()
  {
    if (_M_search_from_first())
      return true;
    if (_M_flags & regex_constants::match_continuous)
      return false;
    _M_flags |= regex_constants::match_prev_avail;
    while (_M_begin != _M_end)
    {
      ++_M_begin;
      if (_M_search_from_first())
        return true;
    }
    return false;
  }

So it does traverse the whole range even if the regular expression contains ^.

So is libc++'s implementation.

这篇关于为什么这种不匹配的模式的性能随搜索空间的大小而缩放?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆