关于使用iostream进行解析的准则是什么? [英] What are the guidelines regarding parsing with iostreams?

查看:52
本文介绍了关于使用iostream进行解析的准则是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我发现自己编写了很多解析代码(主要是自定义格式,但这并不是很重要).

I found myself writing a lot of parsing code lately (mostly custom formats, but it isn't really relevant).

为了增强可重用性,我选择将解析函数基于i/o流,以便可以将其与boost::lexical_cast<>之类的东西一起使用.

To enhance reusability, I chose to base my parsing functions on i/o streams so that I can use them with things like boost::lexical_cast<>.

但是我意识到我从来没有读过任何有关如何正确执行此操作的信息.

I however realized I have never read anywhere anything about how to do that properly.

为说明我的问题,让我们考虑一下我有三个类FooBarFooBar:

To illustrate my question, lets consider I have three classes Foo, Bar and FooBar:

Foo由以下格式的数据表示:string(<number>, <number>).

A Foo is represented by data in the following format: string(<number>, <number>).

Bar由以下格式的数据表示:string[<number>].

A Bar is represented by data in the following format: string[<number>].

FooBar是一种变体类型,可以容纳FooBar.

A FooBar is kind-of a variant type that can hold either a Foo or a Bar.

现在假设我为Foo类型写了operator>>():

Now let's say I wrote an operator>>() for my Foo type:

istream& operator>>(istream& is, Foo& foo)
{
    char c1, c2, c3;
    is >> foo.m_string >> c1 >> foo.m_x >> c2 >> std::ws >> foo.m_y >> c3;

    if ((c1 != '(') || (c2 != ',') || (c3 != ')'))
    {
      is.setstate(std::ios_base::failbit);
    }

    return is;
}

对于有效数据,解析正常.但是,如果数据无效:

The parsing goes fine for valid data. But if the data is invalid:

  • foo可能会被部分修改;
  • 已读取输入流中的某些数据,因此不再可用于进一步调用is.
  • foo might be partially modified;
  • Some data in the input stream was read and is thus no longer available to further calls to is.

此外,我还为FooBar类型写了另一个operator>>():

Also, I wrote another operator>>() for my FooBar type:

istream& operator>>(istream& is, FooBar foobar)
{
  Foo foo;

  if (is >> foo)
  {
    foobar = foo;
  }
  else
  {
    is.clear();

    Bar bar;

    if (is >> bar)
    {
      foobar = bar;
    }
  }

  return is; 
}

但是显然它不起作用,因为如果is >> foo失败,则表明某些数据已被读取,并且不再可用于调用is >> bar.

But obviously it doesn't work because if is >> foo fails, some data has already been read and is no longer available for the call to is >> bar.

这是我的问题:

  • 我的错误在哪里?
  • 是否应该将调用写入operator>>以使失败后的初始数据仍然可用?如果是这样,我如何才能有效地做到这一点?
  • 如果没有,是否有办法存储"(和恢复)输入流的完整状态:状态数据?
  • failbitbadbit之间有什么区别?我们什么时候应该使用其中一个?
  • 是否有任何在线参考(或书籍)深入解释了如何处理iostream?不只是基本的东西:完整的错误处理.
  • Where is my mistake here ?
  • Should one write his calls to operator>> to leave the initial data still available after a failure ? If so, how can I do that efficiently ?
  • If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?
  • What differences are they between failbit and badbit ? When should we use one or the other ?
  • Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.

非常感谢您.

推荐答案

就我个人而言,我认为这是合理的问题,并且我还记得自己曾与他们进行过艰苦的斗争.所以我们开始:

Personally, I think these are reasonable questions and I remember very well that I struggled with them myself. So here we go:

我的错误在哪里?

Where is my mistake here ?

我不会称其为"错误",但您可能想确保自己不必回避阅读的内容.也就是说,我将实现输入功能的三个版本.根据特定类型解码的复杂程度,我什至可能不会共享代码,因为无论如何它可能只是一小段.如果超过一行或两行,则可能会共享代码.也就是说,在您的示例中,我将有一个FooBar的提取器,该提取器本质上读取FooBar成员并相应地初始化对象.或者,我将阅读开头部分,然后调用一个共享实现,以提取公共数据.

I wouldn't call it a mistake but you probably want to make sure you don't have to back off from what you have read. That is, I would implement three versions of the input functions. Depending on how complex the decoding of a specific type is I might not even share the code because it might be just a small piece anyway. If it is more than a line or two probably would share the code. That is, in your example I would have an extractor for FooBar which essentially reads the Foo or the Bar members and initializes objects correspondingly. Alternatively, I would read the leading part and then call a shared implementation extracting the common data.

让我们进行此练习是因为有些事情可能会很复杂.从您对格式的描述中,我不清楚字符串"和字符串后面的内容是否定界,例如由空格(空格,制表符等)组成.如果不是,您不能只读取std::string:它们的默认行为是读取直到下一个空格.有多种方法可以将流调整为将字符视为空白(使用std::ctype<char>),但我仅假设存在空间.在这种情况下,Foo的提取器可能如下所示(请注意,所有代码都是完全未经测试的):

Let's do this exercise because there are a few things which may be a complication. From your description of the format it isn't clear to me if the "string" and what follows the string are delimited e.g. by a whitespace (space, tab, etc.). If not, you can't just read a std::string: the default behavior for them is to read until the next whitespace. There are ways to tweak the stream into considering characters as whitespace (using std::ctype<char>) but I'll just assume that there is space. In this case, the extractor for Foo could look like this (note, all code is entirely untested):

std::istream& read_data(std::istream& is, Foo& foo, std::string& s) {
    Foo tmp(s);
    if (is >> get_char<'('> >> tmp.m_x >> get_char<','> >> tmp.m_y >> get_char<')'>)
        std::swap(tmp, foo);
    return is;
}
std::istream& operator>>(std::istream& is, Foo& foo)
{
    std::string s;
    return read_data(is >> s, foo, s);
}

这个想法是read_data()读取FooBar时,读取的Foo部分不同于Bar.类似的方法将用于Bar,但是我忽略了这一点.更有趣的是使用了这个有趣的get_char()函数模板.这就是所谓的 manipulator ,它只是一个将流引用作为参数并返回流引用的函数.由于我们要读取和比较不同的字符,因此我将其作为模板,但每个字符也可以具有一个功能.我太懒了,无法输入:

The idea is that read_data() read the part of a Foo which is different from Bar when reading a FooBar. A similar approach would be used for Bar but I omit this. The more interesting bit is the use of this funny get_char() function template. This is something called a manipulator and is just a function taking a stream reference as argument and returning a stream reference. Since we have different characters we want to read and compare against, I made it a template but you can have one function per character as well. I'm just too lazy to type it out:

template <char Expect>
std::istream& get_char(std::istream& in) {
    char c;
    if (in >> c && c != 'e') {
        in.set_state(std::ios_base::failbit);
    }
    return in;
}

我的代码看起来有点怪异的是,几乎没有检查是否可行.这是因为当读取成员失败时,流只会设置std::ios_base::failbit,我真的不必打扰自己.实际上唯一添加了特殊逻辑的情况是在get_char()中用于处理期望的特定字符.同样,也不会跳过任何空白字符(即使用std::ws):所有输入函数都是formatted input函数,并且默认情况下会跳过空白(您可以使用例如in >> std::noskipws将其关闭),但随后很多的东西行不通.

What looks a bit weird about my code is that there are few checks if things worked. That is because the stream would just set std::ios_base::failbit when reading a member failed and I don't really have to bother myself. The only case where there is actually special logic added is in get_char() to deal with expecting a specific character. Similarly there is no skipping of whitespace characters (i.e. use of std::ws) going on: all the input functions are formatted input functions and these skip whitespace by default (you can turn this off by using e.g. in >> std::noskipws) but then lots of things won't work.

使用类似的读取Bar的实现,读取FooBar看起来像这样:

With a similar implementation for reading a Bar, reading a FooBar would look something like this:

std::istream& operator>> (std::istream& in, FooBar& foobar) {
    std::string s;
    if (in >> s) {
         switch ((in >> std::ws).peek()) {
         case '(': { Foo foo; read_data(in, foo, s); foobar = foo; break; }
         case '[': { Bar bar; read_data(in, bar, s); foobar = bar; break; }
         default: in.set_state(std::ios_base::failbit);
         }
    }
    return in;
 }

此代码使用未格式化的输入函数peek(),该函数仅查看下一个字符.它返回下一个字符,或者如果失败则返回std::char_traits<char>::eof().因此,如果有左括号或左括号,我们将使用read_data()接管.否则,我们总是失败.解决了眼前的问题.继续分发信息...

This code uses an unformatted input function, peek() which just looks at the next character. It either return the next character or it returns std::char_traits<char>::eof() if it fails. So, if there is either an opening parenthesis or an opening bracket we have read_data() take over. Otherwise we always fail. Solved the immediate problem. On to distributing information...

是否应该将他的呼叫写给操作员>>,以便在发生故障后仍然可以使用初始数据?

Should one write his calls to operator>> to leave the initial data still available after a failure ?

一般的答案是:不.如果您无法阅读,则出了点问题,您就放弃了.不过,这可能意味着您需要更加努力地工作才能避免失败.如果您确实需要退出分析数据的位置,则可能需要先使用std::getline()将数据读取到std::string中,然后再分析此字符串.使用std::getline()假定有一个不同的字符停止.默认为换行符(因此命名),但您也可以使用其他字符:

The general answer is: no. If you failed to read something went wrong and you give up. This might mean that you need to work harder to avoid failing, though. If you really need to back off from the position you were at to parse your data, you might want to read data first into a std::string using std::getline() and then analyze this string. Use of std::getline() assumes that there is a distinct character to stop at. The default is newline (hence the name) but you can use other characters as well:

std::getline(in, str, '!');

这将在下一个感叹号处停止并将所有字符存储在str中.它还会提取终止符,但不会存储终止符.有时候,当您读取文件的最后一行(可能没有换行符)时,这很有意思:std::getline()如果它可以读取至少一个字符,则成功.如果您需要知道文件中的最后一个字符是否为换行符,则可以测试流是否已到达:

This would stop at the next exclamation mark and store all characters up to it in str. It would also extract the termination character but it wouldn't store it. This makes it interesting sometimes when you read the last line of a file which may not have a newline: std::getline() succeeds if it can read at least one character. If you need to know if the last character in a file is a newline, you can test if the stream reached:

if(std :: getline(in,str)&& in.eof()){std :: cout<< 文件未以换行符结尾\"; }

if (std::getline(in, str) && in.eof()) { std::cout << "file not ending in newline\"; }

如果是这样,我如何有效地做到这一点?

If so, how can I do that efficiently ?

从本质上讲,流是单次通过:您只收到一次每个字符,如果跳过一个字符,则将其消耗掉.因此,您通常希望以不必回溯的方式来构造数据.就是说,这并非总是可能的,大多数流实际上在后台有一个缓冲区,可以返回字符.由于流可以由用户实现,因此不能保证可以返回字符.即使对于标准流,也没有真正的保证.

Streams are by their very nature single pass: you receive each character just once and if you skip over one you consume it. Thus, you typically want to structure your data in a way such that you don't have to backtrack. That said, this isn't always possible and most streams actually have a buffer under the hood two which characters can be returned. Since streams can be implemented by a user there is no guarantee that characters can be returned. Even for the standard streams there isn't really a guarantee.

如果要返回一个字符,则必须完全放回提取的字符:

If you want to return a character, you have to put back exactly the character you extracted:

char c;
if (in >> c && c != 'a')
    in.putback(c);
if (in >> c && c != 'b')
    in.unget();

后一个函数的性能稍好一些,因为它不必检查字符是否确实是被提取的那个字符.它还有更少的失败机会.从理论上讲,您可以放回任意数量的字符,但是大多数情况下大多数流都不会支持少数几个字符:如果有缓冲区,则标准库将负责取消加粗"所有字符,直到缓冲区开始到达了.如果返回另一个字符,它将调用虚拟函数std::streambuf::pbackfail(),该函数可能会或可能不会提供更多的缓冲区空间.在我实现的流缓冲区中,它通常只会失败,即,我通常不会覆盖此功能.

The latter function has slightly better performance because it doesn't have to check that the character is indeed the one which was extracted. It also has less chances to fail. Theoretically, you can put back as many characters as you want but most streams won't support more than a few in all cases: if there is a buffer, the standard library takes care of "ungetting" all characters until the start of the buffer is reached. If another character is returned, it calls the virtual function std::streambuf::pbackfail() which may or may not make more buffer space available. In the stream buffers I have implemented it will typically just fail, i.e. I typically don't override this function.

如果没有,是否有办法存储"(和恢复)输入流的完整状态:状态和数据?

If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?

如果您打算完全恢复您所处的状态(包括字符),则答案是:确定存在. ...但是没有简单的方法.例如,您可以实现一个过滤流缓冲区,并如上所述放回字符以恢复要读取的序列(或支持在流中查找或显式设置标记).对于某些流,您可以使用搜索,但并非所有流都支持此功能.例如,std::cin通常不支持搜索.

If you mean to entirely restore the state you were at, including the characters, the answer is: sure there is. ...but no easy way. For example, you could implement a filtering stream buffer and put back characters as described above to restore the sequence to be read (or support seeking or explicitly setting a mark in the stream). For some streams you can use seeking but not all streams support this. For example, std::cin typically doesn't support seeking.

不过,恢复角色只是故事的一半.您要还原的其他内容是状态标志和任何格式数据.实际上,如果流进入失败状态甚至是坏状态,则需要在流将执行大多数操作之前清除状态标志(尽管我认为格式化东西仍然可以重置):

Restoring the characters is only half the story, though. The other stuff you want to restore are the state flags and any formatting data. In fact, if the stream went into a failed or even bad state you need to clear the state flags before the stream will do most operations (although I think the formatting stuff can be reset anyway):

std::istream fmt(0); // doesn't have a default constructor: create an invalid stream
fmt.copyfmt(in);     // safe the current format settings
// use in
in.copyfmt(fmt);     // restore the original format settings

函数copyfmt()复制与流相关的所有与格式相关的字段.这些是:

The function copyfmt() copies all fields associated with the stream which are related to formatting. These are:

  • 语言环境
  • fmtflags
  • 信息存储iword()和pword()
  • 信息流的事件
  • 例外
  • 流的状态

如果您不了解其中的大多数内容,请不要担心:您可能不会在意的大多数内容.好吧,直到您需要它为止,但希望到那时为止,您已经获得了一些文档并阅读了(或询问并获得了很好的答复).

If you don't know about most of them don't worry: most stuff you probably won't care about. Well, until you need it but by then you have hopefully acquired some documentation and read about it (or ask and got a good response).

failbit和badbit之间有什么区别?我们什么时候应该使用其中一个?

What differences are they between failbit and badbit ? When should we use one or the other ?

最后一个简短的例子:

    当检测到格式化错误时,例如设置
  • failbit.一个数字,但是找到了字符"T".
  • 当流的基础结构出现问题时,将设置
  • badbit.例如,当未设置流缓冲区时(如在上面的流fmt中一样),该流已设置为std::badbit.另一个原因是是否引发了异常(并通过exceptions()掩码捕获;默认情况下,所有异常均被捕获).
  • failbit is set when formatting errors are detected, e.g. a number is expected but the character 'T' is found.
  • badbit is set when something goes wrong in the stream's infrastructure. For example, when the stream buffer isn't set (as in the stream fmt above) the stream has std::badbit set. The other reason is if an exception is thrown (and caught by way of the the exceptions() mask; by default all exceptions are caught).

是否有任何在线参考(或书籍)深入解释了如何处理iostream?不仅仅是基本的东西:完整的错误处理.

Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.

啊,是的,很高兴你问.您可能想要获得Nicolai Josuttis的"C ++标准库".我知道这本书描述了所有细节,因为我为编写本书做出了贡献.如果您真的想全面了解IOStreams和区域设置,请使用Angelika Langer&克劳斯·克雷夫特(Klaus Kreft)的"IOStreams和语言环境".如果您想知道我从哪里得到的信息:这是史蒂夫·蒂尔(Steve Teale)的"IOStreams",我不知道这本书是否仍在印刷中,并且缺少很多在标准化过程中引入的内容.由于我实现了自己的IOStreams(和语言环境)版本,所以我也了解这些扩展.

Ah, yes, glad you asked. You probably want to get Nicolai Josuttis's "The C++ Standard Library". I know that this book describes all the details because I contributed to writing it. If you really want to know everything about IOStreams and locales you want Angelika Langer & Klaus Kreft's "IOStreams and Locales". In case you wonder where I got the information from originally: this was Steve Teale's "IOStreams" I don't know if this book is still in print and it lacking a lot of the stuff which was introduced during standardization. Since I implemented my own version of IOStreams (and locales) I know about the extensions as well, though.

这篇关于关于使用iostream进行解析的准则是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆