为什么std :: regex_iterator导致与此数据的堆栈溢出? [英] Why does std::regex_iterator cause a stack overflow with this data?

查看:179
本文介绍了为什么std :: regex_iterator导致与此数据的堆栈溢出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直使用 std :: regex_iterator 来解析日志文件。我的程序已经工作相当不错几个星期,已经解析了数百万条日志行,直到今天,当今天我对一个日志文件运行它,并得到一个堆栈溢出。事实证明,只有一个日志行在日志文件中导致的问题。有谁知道知道为什么我的正则表达式引起这么大的递归?这是一个小的自包含程序显示的问题(我的编译器是VC2012):

I've been using std::regex_iterator to parse log files. My program has been working quite nicely for some weeks and has parsed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. It turned out that just one log line in the log file were causing the problem. Does anyone know know why my regex is causing such massive recursion? Here's a small self contained program which shows the issue (my compiler is VC2012):

#include <string>
#include <regex>
#include <iostream>

using namespace std;

std::wstring test = L"L3  T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
                L"  Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
                L"  Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  Operation: 3\n"
                L"  Hive: HKEY_CURRENT_USER\n"
                L"  Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  ValueName: SortOrder\n"
                L"  ValueType: REG_DWORD\n"
                L"  ValueData: 0\n"
                L"L4  T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";

int wmain(int argc, wchar_t* argv[])
{
    static wregex rgx_log_lines(
        L"^L(\\d+)\\s+"             // Level
        L"T(\\d+)\\s+"              // TID
        L"(\\d+)\\s+"               // Timestamp
        L"\\[((?:\\w|\\:)+)\\]"     // Function name
        L"((?:"                     // Complex pattern
          L"(?!"                    // Stop matching when...
            L"^L\\d"                // New log statement at the beginning of a line
          L")"                      
          L"[^]"                    // Matching all until then
        L")*)"                      // 
        );

    try
    {
        for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
        {
            wcout << (*it)[1] << endl;
            wcout << (*it)[2] << endl;
            wcout << (*it)[3] << endl;
            wcout << (*it)[4] << endl;
            wcout << (*it)[5] << endl;
        }
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
    }

    return 0;
}


推荐答案

对每个字符只是对我来说似乎是一个坏主意,而你想做的是不复杂。你想匹配(1)行的其余部分,然后(2)任何数量的以下(3)行除了L\d(小bug;见下面)之外开始:(另一个编辑:这些是正则表达式;如果要将它们写为字符串字面量,则需要将 \ 更改为 \\

Negative lookahead patterns which are tested on every character just seem like a bad idea to me, and what you're trying to do is not complicated. You want to match (1) the rest of the line and then (2) any number of following (3) lines which start with something other than L\d (small bug; see below): (another edit: these are regexes; if you want to write them as string literals, you need to change \ to \\.)

 .*\n(?:(?:[^L]|L\D).*\n)*
 |   |  |
 +-1 |  +---------------3
     +---------------------2

在Ecmascript模式下,不应与\\\
匹配,可以随时用 [^ \\\
]

In Ecmascript mode, . should not match \n, but you could always replace the two .s in that expression with [^\n]

编辑添加:我意识到,如果在日志条目结束之前有一个空行,这可能不工作,但这应该包括这种情况;我更改了。到 [^ \\\
]
,以获得更高的精度:

Edited to add: I realize that this may not work if there is a blank line just before the end of the log entry, but this should cover that case; I changed . to [^\n] for extra precision:

 [^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*

这篇关于为什么std :: regex_iterator导致与此数据的堆栈溢出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆