在istream上使用`regex_iterator` [英] Using a `regex_iterator` on an istream

查看:143
本文介绍了在istream上使用`regex_iterator`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想能够解决这样的问题:获取std :: ifstream来处理LF,CR和CRLF?其中 istream 需要通过复杂分隔符进行标记;这样对 istream 进行符号化的唯一方法是:


  1. 阅读 istream 一次一个字符

  2. 收集字符



  3. 正则表达式非常适合用复杂分隔符对字符串进行标记:

      string foo {A\\\
    B\rC\\\
    \r};
    vector< string>酒吧;

    //将{A,B,C}放入bar
    transform(sregex_iterator(foo.cbegin(),foo.cend(),regex (。*)(?: \\\
    \r?| \r))),sregex_iterator(),back_inserter(bar),[](const smatch& i){return i [1] ;});

    但我不能使用 regex_iterator on $ istream :(我的解决方案是清除 istream ,然后运行 regex_iterator $ <$>

    $ <$> $ istream_iterator
    regex_iterator 解决方案

    这个问题是关于代码外观的:


    1. 因为我们知道一个 regex 将一次工作1个字符,这个问题是要求使用库一次解析 istream 1个字符,而不是内部读取,一次解析 istream 1个字符

    2. 由于解析 istream 1字符将仍然将该一个字符复制到临时变量(缓冲区),该代码试图避免内部缓冲所有代码,这取决于库而不是抽象






    C ++ 11的 regex 使用ECMA-262不支持前瞻look behinds: http://stackoverflow.com/a/14539500/2642059 这意味着 regex 只能使用 input_iterator_tag 匹配,但显然在C ++ 11中实现的那些不会。



    boost :: regex_iterator 另一方面,支持 boost :: match_partial 在C ++ 11中不可用 regex 标志。) boost :: match_partial 允许用户清除文件的部分,并运行 regex ,因为由于输入结束的不匹配, regex 将在正则表达式中的那个位置保持它的手指,并等待更多的添加到缓冲区。您可以在此处查看示例: http: //www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html 在一般情况下,例如A\\\
    B \ rC \\\
    \r
    ,这可以节省缓冲区大小。



    boost :: match_partial 有4个缺点:


    1. 在最坏的情况下,例如ABC\\\
      code>这会储存使用者大小,而且必须清除整个 istream

    2. If程序员可以猜测一个太大的缓冲区大小,也就是说它包含了分隔符和更大的缓冲区大小,减少了缓冲区大小的好处。

    3. 任何时候缓冲区大小选择太小,与整个文件的清理相比,需要额外的计算,因此这种方法在分隔符密集的字符串中非常好。

    4. 包含 boost 总是导致膨胀

    回到问题:标准库 regex_iterator 无法在 input_iterator_tag 上操作,需要清除整个 istream boost :: regex_iterator 允许用户可能 slurp小于整个 istream 。因为这是一个关于代码外观的问题,并且因为 boost :: regex_iterator '最糟糕的情况需要整个文件扼杀,这不是一个很好的答案这个问题。



    对于最好的代码外观,扼杀整个文件并运行标准 regex_iterator ,这是你最好的选择。


    I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:

    1. Read it in the istream a character at a time
    2. Collect the characters
    3. When a delimiter is hit return the collection as a token

    Regexes are very good at tokenizing strings with complex delimiters:

    string foo{ "A\nB\rC\n\r" };
    vector<string> bar;
    
    // This puts {"A", "B", "C"} into bar
    transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
    

    But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.

    Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

    解决方案

    This question is about code appearance:

    1. Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
    2. Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that


    C++11's regexes use ECMA-262 which does not support look aheads or look behinds: http://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.

    boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.

    boost::match_partial has 4 drawbacks:

    1. In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
    2. If the programmer can guesses a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
    3. Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
    4. The inclusion of boost always causes bloat

    Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.

    For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.

    这篇关于在istream上使用`regex_iterator`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆