在istream上使用regex_iterator [英] Using a regex_iterator on an istream
问题描述
我希望能够解决这样的问题:让std :: ifstream处理LF,CR和CRLF?其中 istream
需要由复杂的分隔符标记化;这样,将$ code> istream 标记的唯一方法是:
I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream
needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream
is to:
- 读取它
istream
一次一个字符 - 收集字符
- 点击分隔符时将集合作为标记返回
- Read it in the
istream
a character at a time - Collect the characters
- When a delimiter is hit return the collection as a token
正则表达式非常擅长使用复杂分隔符标记字符串:
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
但我不能使用 regex_iterator
在一个 istream
:(我的解决方案是淹没 istream
,然后运行 regex_iterator
超过它,但是啜饮的步骤似乎是多余的。
But I can't use a regex_iterator
on a istream
:( My solution has been to slurp the istream
and then run the regex_iterator
over it, but the slurping step seems superfluous.
是否存在 istream_iterator $ c $的邪恶组合c>和
regex_iterator
在某处,或者如果我想要它,我必须自己编写吗?
Is there an unholy combination of istream_iterator
and regex_iterator
out there somewhere, or if I want it do I have to write it myself?
推荐答案
这个问题是关于代码外观:
This question is about code appearance:
- 因为我们知道
正则表达式
一次只能处理1个字符,这个问题要求使用库一次解析istream
1个字符,而不是内部读取和每次解析istream
1个字符 - 因为解析
istream
1一次只能将一个字符复制到临时变量(缓冲区),这个代码试图避免缓冲所有t他在内部编码,取决于库而不是抽象
- Since we know that a
regex
will work 1 character at a time, this question is asking to use a library to parse theistream
1 character at a time rather than internally reading and parsing theistream
1 character at a time - Since parsing an
istream
1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C ++ 11's 正则表达式
使用ECMA-262,它不支持预见或后视: https ://stackoverflow.com/a/14539500/2642059 这意味着正则表达式
只能使用 input_iterator_tag $ c匹配$ c>,但显然那些在C ++ 11中实现的不会。
C++11's regex
es use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex
could match using only an input_iterator_tag
, but clearly those implemented in C++11 do not.
boost :: regex_iterator
on the另一只手确实支持 boost :: match_partial
标志(这是在C ++ 11中不可用正则表达式
标志。) boost :: match_partial
允许用户啜饮文件的部分并运行正则表达式
,因为输入结束时不匹配正则表达式
将握住它的手指在正则表达式中的那个位置并等待更多添加到缓冲区。您可以在此处查看示例: http ://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html 在一般情况下,例如A\\\
,这可以节省缓冲区大小。
B \\ \\ rC \ n \\ n
boost::regex_iterator
on the other hand does support the boost::match_partial
flag (which is not available in C++11 regex
flags.) boost::match_partial
allows the user to slurp part of the file and run the regex
over that, on a mismatch due to end of input the regex
will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r"
, this can save buffer size.
boost :: match_partial
有4个缺点:
- 在最坏的情况下,如
ABC \ n
这会保存用户 no 大小,他必须啜饮整个istream
- 如果程序员可以猜到一个太大的缓冲区大小,即包含分隔符和更多的分隔符,那么减少缓冲区大小的好处就会被浪费。
- 任何时候缓冲区选择的大小太小,与整个文件的啜食相比,将需要额外的计算,因此这种方法在分隔符密集的字符串中表现优异
- 包含
提升
总是会导致膨胀
- In the worst case, like
"ABC\n"
this saves the user no size and he must slurp the wholeistream
- If the programmer can guesses a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
- Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
- The inclusion of
boost
always causes bloat
回过头来回答问题:标准库 regex_iterator
无法在 input_iterator_tag
上运行,整个 istream sl sl
必填。 boost :: regex_iterator
允许用户可能 slurp小于整个 istream
。因为这是一个关于代码外观的问题,并且因为 boost :: regex_iterator
的最坏情况需要整个文件的淤塞,这不是一个很好的答案这个问题。
Circling back to answer the question: A standard library regex_iterator
cannot operate on an input_iterator_tag
, slurping of the whole istream
required. A boost::regex_iterator
allows the user to possibly slurp less than the whole istream
. Because this is a question about code appearance though, and because boost::regex_iterator
's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
为了获得最好的代码外观,诋毁整个文件并运行标准的 regex_iterator
,这是你最好的选择。
For the best code appearance slurping the whole file and running a standard regex_iterator
over it is your best bet.
这篇关于在istream上使用regex_iterator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!