有限状态机解析器 [英] Finite State Machine parser

查看:182
本文介绍了有限状态机解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用C ++中的类似FSM的解析器解析自己设计的文件格式(这是一个教学自我 - c ++ - 硬的方式-by-doing-something-big-and-difficult 项目类型:))。我有一个带有换行符的标记字符串,表示euh ...行的结尾。有关输入示例,请参阅此处。所有的评论将会被垃圾过滤掉,所以我有一个这样的std :: string:

  global \\\
{ \\\
SOURCE_DIRS src \\\
HEADER_DIRS include \\\
SOURCES bitwise.c framing.c \\\
HEADERS ogg / os_types.h ogg / ogg.h \\\
} \\\
...

语法说明:




  • {}是范围,大小写的单词表示选项/文件的列表将遵循。

  • \\\
    在选项/文件列表中只是重要的,表示列表的结尾。



所以我以为FSM可以简单/可扩展,足够满足我的需求/知识。据我所知(我想要我的文件设计),我不需要并发状态或任何类似的东西。一些设计/实现问题:


  1. 我应该使用枚举 code> class +我的州的衍生品第一个可能对于较小的语法来说可能更好,但是后来可能会变得丑陋,第二个恰恰相反。我倾向于第一个,因为它的简单性。 枚举示例类示例。编辑:此建议对于 goto ,我以为他们在C ++中是邪恶的?

  2. 在阅读列表时,我不需要忽略 \\\
    。我通过 stringstream 使用字符串的首选方式将忽略 \\\
    默认。所以我需要简单的方式告诉(同样!) stringstream 在启用某个状态时不会忽略换行符。

  3. 简单的枚举状态足以进行多级解析(范围内的范围 {... {...} ...} )还是需要hacky的实现?

  4. 这里是我想到的草案状态:


    • upper :读取全局,exe,lib +目标名称...

    • 正常在范围内,可以读取SOURCES ...,创建用户变量...

    • 列表:将项目添加到列表,直到


每个范围都会有一种条件(例如win32 :global {gcc:CFLAGS = ...}),并且需要以完全相同的方式处理eveywhere(即使在列表状态,每个项目) p>

感谢任何输入。

解决方案

如果您有嵌套范围,一个 有限状态机是正确的方式,你应该看一个上下文免费语法解析器。 LL(1)解析器可以写成一组递归函数,或者<一个href =http://en.wikipedia.org/wiki/LALR_parser =noreferrer> LALR(1)解析器可以使用诸如Bison等解析器生成器来编写。



如果您向FSM添加堆栈,那么您将进入下推自动机领土。非确定性下推自动机相当于上下文无关语法(尽管确定性下推自动机是严格的LALR(1)解析器生成器实际上在内部生成确定性下推自动机。一个好的编译器设计教科书将涵盖从语法构建下推自动机的确切算法。 (这样,添加堆栈不是hacky。​​)这个维基百科文章还描述如何从您的语法构建LR(1)下推自动机,但是IMO,文章不是那么清楚。



如果你的范围只有有限的深度(即您有上限正常列表级别但是您没有嵌套列表或嵌套正常 s),那么您可以使用没有堆栈的FSM 。


I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:

global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...

Syntax explanation:

  • { } are scopes, and capitalized words signify that a list of options/files is to follow.
  • \n are only important in a list of options/files, signifying the end of the list.

So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:

  1. Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
  2. When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
  3. Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
  4. Here's the draft states I have in mind:
    • upper: reads global, exe, lib+ target names...
    • normal: inside a scope, can read SOURCES..., create user variables...
    • list: adds items to a list until a newline is encountered.

Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).

Thanks for any input.

解决方案

If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.

If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.

If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.

这篇关于有限状态机解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆