手工编写一个解析器 [英] hand coding a parser

查看:134
本文介绍了手工编写一个解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关你编译大师,我想编写一个递归下降解析器和我想只用代码做到这一点。从其他一些没有语法词法分析器生成和分析器,不告诉我读龙书,我会围过来,最终。



我想进入有关实现词法分析器和解析器合理简单的langauge相关细节,说CSS。我想这样做的权利。



这可能最终会被一系列的问题,但现在我开始一个词法分析器。对CSS规则,断词都可以在这里找到



我觉得我自己写的代码是这样的(希望你可以推断这个片段的其余部分):

 公共CssToken ReadNext()
{
INT VAL;
,而((VAL = _reader.Read())!= -1)
{
变种C =(char)的VAL;
开关(_stack.Top)
{
情况下ParserState.Init:
如果(C =='')
{
继续; //忽略
}
,否则如果(C =='')
{
_stack.Transition(ParserState.SubIdent,ParserState.Init);
}
中断;

情况下ParserState.SubIdent:
如果(C ==' - ')
{
_token.Append(C);
}
_stack.Transition(ParserState.SubNMBegin);
中断;



这是什么叫什么?如何遥远我是从合理的东西很好理解?我想balence的东西是公平的效率和易于使用方面,使用栈来实现某种状态机的工作不错,但我不能确定如何继续这个样子。



我已经是一个输入流,从中我可以同时读取1个字符。我没有做任何看起来头,现在,我刚刚看了那么字符根据当前国家试图做的东西。



我真的很想要进的思维定势编写代码的可重用片段。这过渡方法目前意味着要做到这一点,它会弹出堆栈的当前状态,然后按下参数以相反的顺序。这样一来,当我写过渡(ParserState.SubIdent,ParserState.Init)将呼一个子程序 SubIdent 这将完成后,返回初始化状态。



解析器将在大致相同的实施这样,目前,在这样一个大的方法有everyhing可以让我轻松地返回一个标记,当我发现之一,但它也迫使我把一切都在一个单一的大方法。有这些符号化的规则拆分成单独的方法好方法?



对此事的任何输入/建议将不胜appriciated!


< DIV CLASS =h2_lin>解决方案

什么你写称为下推自动机。这通常是更多的权力比你需要写一个词法分析器,如果你正在写一个现代语言如CSS词法分析器它肯定过多。递归下降解析器是接近权力下推自动机,但递归下降解析器更容易​​编写和理解。大多数解析器生成器生成下推自动机。



词法分析器几乎总是写为有限状态机,即,像你的代码,除了摆脱了堆栈的对象。有限状态机是密切相关的正则表达式(实际上,它们是可证明相当于彼此)。当设计这样一个分析器,人们通常与正则表达式开始,并使用它们来创建确定性有限自动机,并在过渡一些额外的代码来记录每个标记的开始和结束。



有工具来做到这一点。在工具及其后代是众所周知的,已被翻译成多种文字。在 ANTLR 工具链也有一个词法分析器的组成部分。我的首选工具是 ragel 在支持的平台。还有一点好处,以手工的的时间写一个词法分析器,并通过这些工具生成的代码可能会更快,更可靠。



如果你想手工编写你自己的词法分析器,好的往往是这个样子:

 函数readToken() //注意:只返回一个令牌,每次
,而EOF
C = peekChar()
如果c在A-ZA-Z
返回readIdentifier()
别的!如果c为0-9
返回readInteger()
如果C在\\\
\r\t\v\f'
nextChar()
...
返回EOF

功能readIdentifier()
IDENT =
一会儿!EOF
C = nextChar()
如果c在A-ZA-Z0-9
ident.append(C)
,否则
返回令牌(标识符,IDENT)
//或者...
返回标识符(IDENT)

然后,你可以写你的解析器递归下降解析器。不要试图词法/语法分析器阶段合并成一个,它会导致代码一塌糊涂。 (据笔者秒差距,它的速度较慢,太)。


For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.

I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple langauge, say CSS. And I wanna do this right.

This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.

I find my self writing code like this (hopefully you can infer the rest from this snippet):

public CssToken ReadNext()
{
    int val;
    while ((val = _reader.Read()) != -1)
    {
        var c = (char)val;
        switch (_stack.Top)
        {
            case ParserState.Init:
                if (c == ' ')
                {
                    continue; // ignore
                }
                else if (c == '.')
                {
                    _stack.Transition(ParserState.SubIdent, ParserState.Init);
                }
                break;

            case ParserState.SubIdent:
                if (c == '-')
                {
                    _token.Append(c);
                }
                _stack.Transition(ParserState.SubNMBegin);
                break;

What is this called? and how far off am I from something reasonable well understood? I'm trying to balence something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.

What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.

I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.

The parser will be implemented in much the same way, currently, having everyhing in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into seperate methods?

Any input/advice on the matter would be greatly appriciated!

解决方案

What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.

Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.

There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.

If you do want to write your own lexer by hand, good ones often look something like this:

function readToken() // note: returns only one token each time
    while !eof
        c = peekChar()
        if c in A-Za-z
            return readIdentifier()
        else if c in 0-9
            return readInteger()
        else if c in ' \n\r\t\v\f'
            nextChar()
        ...
    return EOF

function readIdentifier()
    ident = ""
    while !eof
        c = nextChar()
        if c in A-Za-z0-9
            ident.append(c)
        else
            return Token(Identifier, ident)
            // or maybe...
            return Identifier(ident)

Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).

这篇关于手工编写一个解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆