手编码解析器 [英] hand coding a parser

查看:155
本文介绍了手编码解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于所有的编译器gurus,我想写一个递归下降解析器,我想做它只是代码。没有生成其他语法的词法分析器和解析器,不告诉我阅读龙书,我会最终到达。

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.

我想进入关于实现一个lexer和解析器为一个合理的简单langauge,说CSS的细节细节。我想做这个权利。

I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple langauge, say CSS. And I wanna do this right.

这可能会导致一系列问题,但现在我开始使用词法分析器。可以在此处找到标记化规则。

This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.

我发现我的自我编写代码是这样的(希望你可以从这段代码推断其他代码):

I find my self writing code like this (hopefully you can infer the rest from this snippet):

public CssToken ReadNext()
{
    int val;
    while ((val = _reader.Read()) != -1)
    {
        var c = (char)val;
        switch (_stack.Top)
        {
            case ParserState.Init:
                if (c == ' ')
                {
                    continue; // ignore
                }
                else if (c == '.')
                {
                    _stack.Transition(ParserState.SubIdent, ParserState.Init);
                }
                break;

            case ParserState.SubIdent:
                if (c == '-')
                {
                    _token.Append(c);
                }
                _stack.Transition(ParserState.SubNMBegin);
                break;

这是什么名字?和我从一个合理的理解多远有多远?我试图在一个公平的效率和容易工作,使用堆栈实现某种状态机工作相当好,但我不确定如何继续这样的东西。

What is this called? and how far off am I from something reasonable well understood? I'm trying to balence something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.

我有一个输入流,我可以从中读取一个字符。我现在不做任何看头,我只是读取字符,然后根据当前状态尝试做一些事情。

What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.

我真的很喜欢进入思维集写可重复使用的代码片段。这个 Transition 方法目前意味着这样做,它将弹出堆栈的当前状态,然后以相反的顺序推送参数。这样,当我写 Transition(ParserState.SubIdent,ParserState.Init)时,它将调用子例程 SubIdent 这将完成后返回到 Init 状态。

I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.

解析器将以大致相同的方式实现方式,目前,在一个单一的大方法,如这样让我很容易返回一个令牌,当我找到一个,但它也迫使我保持​​一切都在一个单一的大方法。有没有一个很好的方法来将这些标记化规则分成独立的方法?

The parser will be implemented in much the same way, currently, having everyhing in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into seperate methods?

任何输入/建议的事情都会非常抱歉!

Any input/advice on the matter would be greatly appriciated!

推荐答案

你正在写的是一个下推自动机。这通常比你需要写一个词法分析器更多的权力,如果你编写一个像CSS这样的现代语言的词法分析器,它肯定是过多的。递归下降解析器接近下推自动机,但是递归下降解析器更容易写和理解。大多数解析器生成器生成下推式自动机。

What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.

Lexers几乎总是写为有限状态机,也就是像你的代码,除了摆脱堆栈对象。有限状态机与正则表达式紧密相关(实际上,它们彼此相同)。当设计这样的解析器时,通常从正则表达式开始,并使用它们创建确定性有限自动机,在转换中有一些额外的代码来记录每个令牌的开始和结束。

Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.

有工具可以做到这一点。 lex 工具及其后代是众所周知的,并已被翻译成许多语言。 ANTLR 工具链也有一个词法分析器组件。我的首选工具是 ragel 在支持它的平台上。用大多数时间手动编写词法分析器几乎没有什么好处,这些工具生成的代码可能会更快更可靠。

There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.

如果你想手动编写自己的词法分析器,那么好的语言代码通常看起来像这样:

If you do want to write your own lexer by hand, good ones often look something like this:

function readToken() // note: returns only one token each time
    while !eof
        c = peekChar()
        if c in A-Za-z
            return readIdentifier()
        else if c in 0-9
            return readInteger()
        else if c in ' \n\r\t\v\f'
            nextChar()
        ...
    return EOF

function readIdentifier()
    ident = ""
    while !eof
        c = nextChar()
        if c in A-Za-z0-9
            ident.append(c)
        else
            return Token(Identifier, ident)
            // or maybe...
            return Identifier(ident)

然后,您可以将解析器写为递归下降解析器。不要试图将词法分析器/解析器阶段合并为一个,它会导致一个完整的代码混乱。 (根据Parsec的作者,它也慢)。

Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).

这篇关于手编码解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆