制作词法分析器 [英] Making a lexical Analyzer

查看:222
本文介绍了制作词法分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lexical Analyzer程序,而我正在使用Java。我一直在研究这个问题的答案,但直到现在我都没找到。这是我的问题:

I'm working with a Lexical Analyzer program right now and I'm using Java. I've been researching for answers on this problem but until now I failed to find any. Here's my problem:

输入:

System.out.println ("Hello World");

期望输出:

Lexeme----------------------Token

System [Key_Word]

.       [Object_Accessor]

out   [Key_Word]

. [Object_Accessor]

println  [Key_Word]

(  [left_Parenthesis]

"Hello World"    [String_Literal]

)   [right_Parenthesis]

;  [statement_separator]

我还是初学者所以我希望你们可以帮助我。谢谢。

I'm still a beginner so I hope you guys can help me on this. Thanks.

推荐答案

您既不需要ANTLR也不需要Dragon书来手工编写简单的词法分析器。甚至用于更全面的语言(如Java)的词法分析器也不是手工编写的非常复杂。显然,如果您有工业任务,您可能需要考虑像ANTLR或某些词法变体这样的工业强度工具,但为了学习词法分析的工作原理,手工编写可能会证明是一项有用的练习。我假设情况就是这样,因为你说你还是初学者。

You need neither ANTLR nor the Dragon book to write a simple lexical analyzer by hand. Even lexical analyzers for fuller languages (like Java) aren't terribly complicated to write by hand. Obviously if you have an industrial task you might want to consider industrial strength tools like ANTLR or some lex variant, but for the sake of learning how lexical analysis works, writing one by hand would likely prove to be a useful exercise. I'm assuming that this is the case, since you said you're still a beginner.

这是一个简单的词法分析器,用Java编写,用于a的子集我看到这个问题后写的类似计划的语言。我认为代码相对容易理解,即使你之前从未见过词法分析器,只是因为将一个字符串(在这种情况下为 String )分成一个流令牌(在这种情况下是列表< Token> )并不难。如果您有任何疑问,我可以尝试更深入地解释。

Here's a simple lexical analyzer, written in Java, for a subset of a Scheme-like language, that I wrote after seeing this question. I think the code is relatively easy to understand even if you've never seen a lexer before, simply because breaking a stream of characters (in this case a String) into a stream of tokens (in this case a List<Token>) isn't that hard. If you have questions I can try to explain in more depth.

import java.util.List;
import java.util.ArrayList;

/*
 * Lexical analyzer for Scheme-like minilanguage:
 * (define (foo x) (bar (baz x)))
 */
public class Lexer {
    public static enum Type {
        // This Scheme-like language has three token types:
        // open parens, close parens, and an "atom" type
        LPAREN, RPAREN, ATOM;
    }
    public static class Token {
        public final Type t;
        public final String c; // contents mainly for atom tokens
        // could have column and line number fields too, for reporting errors later
        public Token(Type t, String c) {
            this.t = t;
            this.c = c;
        }
        public String toString() {
            if(t == Type.ATOM) {
                return "ATOM<" + c + ">";
            }
            return t.toString();
        }
    }

    /*
     * Given a String, and an index, get the atom starting at that index
     */
    public static String getAtom(String s, int i) {
        int j = i;
        for( ; j < s.length(); ) {
            if(Character.isLetter(s.charAt(j))) {
                j++;
            } else {
                return s.substring(i, j);
            }
        }
        return s.substring(i, j);
    }

    public static List<Token> lex(String input) {
        List<Token> result = new ArrayList<Token>();
        for(int i = 0; i < input.length(); ) {
            switch(input.charAt(i)) {
            case '(':
                result.add(new Token(Type.LPAREN, "("));
                i++;
                break;
            case ')':
                result.add(new Token(Type.RPAREN, ")"));
                i++;
                break;
            default:
                if(Character.isWhitespace(input.charAt(i))) {
                    i++;
                } else {
                    String atom = getAtom(input, i);
                    i += atom.length();
                    result.add(new Token(Type.ATOM, atom));
                }
                break;
            }
        }
        return result;
    }

    public static void main(String[] args) {
        if(args.length < 1) {
            System.out.println("Usage: java Lexer \"((some Scheme) (code to) lex)\".");
            return;
        }
        List<Token> tokens = lex(args[0]);
        for(Token t : tokens) {
            System.out.println(t);
        }
    }
}

使用示例:

~/code/scratch $ java Lexer ""
~/code/scratch $ java Lexer "("
LPAREN
~/code/scratch $ java Lexer "()"
LPAREN
RPAREN
~/code/scratch $ java Lexer "(foo)"
LPAREN
ATOM<foo>
RPAREN
~/code/scratch $ java Lexer "(foo bar)"
LPAREN
ATOM<foo>
ATOM<bar>
RPAREN
~/code/scratch $ java Lexer "(foo (bar))"
LPAREN
ATOM<foo>
LPAREN
ATOM<bar>
RPAREN
RPAREN

一旦你写了一两个简单的像这样的词法分析器,你会很好地理解这个问题如何分解。然后探索如何使用像lex这样的自动化工具会很有趣。基于正则表达式的匹配器背后的理论并不太难,但确实需要一段时间完全理解。我认为手工编写词法分析器会激发学习并帮助你你可以更好地掌握这个问题,而不是深入研究将正则表达式转换为有限自动机(首先是NFA,然后是NFA到DFA)等理论,然后进入该理论可能需要立即采用,并且很容易让人不知所措。

Once you've written one or two simple lexers like this, you will get a pretty good idea of how this problem decomposes. Then it would be interesting to explore how to use automated tools like lex. The theory behind regular expression based matchers is not too difficult, but it does take a while to fully understand. I think writing lexers by hand motivates that study and helps you come to grips with the problem better than diving into the theory behind converting regular expressions to finite automate (first NFAs, then NFAs to DFAs), etc... wading into that theory can be a lot to take in at once, and it is easy to get overwhelmed.

就个人而言,虽然龙书很好而且非常彻底,但报道可能不是最容易理解,因为它的目的是完整,而不是必须可访问。在打开Dragon书之前,您可能想尝试一些其他编译器文本。这里有一些免费的书,有很好的介绍性报道,恕我直言:

Personally, while the Dragon book is good and very thorough, the coverage might not be the easiest to understand because it aims to be complete, not necessarily accessible. You might want to try some other compiler texts before opening up the Dragon book. Here are a few free books, which have pretty good introductory coverage, IMHO:

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

http://www.diku.dk/~torbenm/Basics/

有些文章关于正则表达式的实现(自动词法分析通常使用正则表达式)

Some articles on the implementation of regular expressions (automated lexical analysis usually uses regular expressions)

http ://swtch.com/~rsc/regexp/

我希望有所帮助。祝你好运。

I hope that helps. Good luck.

这篇关于制作词法分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆