在istream上使用regex_iterator [英] Using a regex_iterator on an istream

查看:152
本文介绍了在istream上使用regex_iterator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够解决这样的问题:让std :: ifstream处理LF,CR和CRLF?其中 istream 需要由复杂的分隔符标记化;这样,将$ code> istream 标记的唯一方法是:

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:


  1. 读取它 istream 一次一个字符

  2. 收集字符

  3. 点击分隔符时将集合作为标记返回

  1. Read it in the istream a character at a time
  2. Collect the characters
  3. When a delimiter is hit return the collection as a token

正则表达式非常擅长使用复杂分隔符标记字符串:

Regexes are very good at tokenizing strings with complex delimiters:

string foo{ "A\nB\rC\n\r" };
vector<string> bar;

// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });

但我不能使用 regex_iterator 在一个 istream :(我的解决方案是淹没 istream ,然后运行 regex_iterator 超过它,但是啜饮的步骤似乎是多余的。

But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.

是否存在 istream_iterator regex_iterator 在某处,或者如果我想要它,我必须自己编写吗?

Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

推荐答案

这个问题是关于代码外观:

This question is about code appearance:


  1. 因为我们知道正则表达式一次只能处理1个字符,这个问题要求使用库一次解析 istream 1个字符,而不是内部读取和每次解析 istream 1个字符

  2. 因为解析 istream 1一次只能将一个字符复制到临时变量(缓冲区),这个代码试图避免缓冲所有t他在内部编码,取决于库而不是抽象

  1. Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
  2. Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that






C ++ 11's 正则表达式使用ECMA-262,它不支持预见或后视: https ://stackoverflow.com/a/14539500/2642059 这意味着正则表达式只能使用 input_iterator_tag ,但显然那些在C ++ 11中实现的不会。


C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.

boost :: regex_iterator on the另一只手确实支持 boost :: match_partial 标志(这是在C ++ 11中不可用正则表达式标志。) boost :: match_partial 允许用户啜饮文件的部分并运行正则表达式,因为输入结束时不匹配正则表达式将握住它的手指在正则表达式中的那个位置并等待更多添加到缓冲区。您可以在此处查看示例: http ://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html 在一般情况下,例如A\\\
B \\ \\ rC \ n \\ n
,这可以节省缓冲区大小。

boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.

boost :: match_partial 有4个缺点:


  1. 在最坏的情况下,如ABC \ n这会保存用户 no 大小,他必须啜饮整个 istream

  2. 如果程序员可以猜到一个太大的缓冲区大小,即包含分隔符和更多的分隔符,那么减少缓冲区大小的好处就会被浪费。

  3. 任何时候缓冲区选择的大小太小,与整个文件的啜食相比,将需要额外的计算,因此这种方法在分隔符密集的字符串中表现优异

  4. 包含提升总是会导致膨胀

  1. In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
  2. If the programmer can guesses a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
  3. Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
  4. The inclusion of boost always causes bloat

回过头来回答问题:标准库 regex_iterator 无法在 input_iterator_tag 上运行,整个 istream sl sl 必填。 boost :: regex_iterator 允许用户可能 slurp小于整个 istream 。因为这是一个关于代码外观的问题,并且因为 boost :: regex_iterator 的最坏情况需要整个文件的淤塞,这不是一个很好的答案这个问题。

Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.

为了获得最好的代码外观,诋毁整个文件并运行标准的 regex_iterator ,这是你最好的选择。

For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.

这篇关于在istream上使用regex_iterator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆