如何使用Java的Scanner类和正则表达式对输入进行符号化? [英] How do I tokenize input using Java's Scanner class and regular expressions?

查看:188
本文介绍了如何使用Java的Scanner类和正则表达式对输入进行符号化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仅仅为了我自己的目的,我试图在Java中构建一个tokenizer,我可以定义一个常规的语法,并基于它的tokenize输入。 StringTokenizer类已弃用,我在Scanner中发现了几个函数,暗示我想做什么,但还没有运气。

Just for my own purposes, I'm trying to build a tokenizer in Java where I can define a regular grammar and have it tokenize input based on that. The StringTokenizer class is deprecated, and I've found a couple functions in Scanner that hint towards what I want to do, but no luck yet. Anyone know a good way of going about this?

推荐答案

名称Scanner有点误导,因为这个词经常用于表示词法分析器,这不是Scanner的用途。所有这些都是C,Perl,等人中找到的 scanf()函数的替代。像StringTokenizer和 split(),它设计为向前扫描,直到找到一个给定模式的匹配,并且在路上跳过的任何内容作为一个令牌返回。

The name "Scanner" is a bit misleading, because the word is often used to mean a lexical analyzer, and that's not what Scanner is for. All it is is a substitute for the scanf() function you find in C, Perl, et al. Like StringTokenizer and split(), it's designed to scan ahead until it finds a match for a given pattern, and whatever it skipped over on the way is returned as a token.

另一方面,词法分析器必须检查和分类每个字符,即使它只是决定它是否可以安全地忽略它们。也就是说,在每次匹配之后,它可以应用几个模式,直到找到一个匹配从该点开始的 。否则,它可能找到序列//,并认为它发现一个注释的开始,当它真的在一个字符串字面量,它只是没有注意到开头的引号。

A lexical analyzer, on the other hand, has to examine and classify every character, even if it's only to decide whether it can safely ignore them. That means, after each match, it may apply several patterns until it finds one that matches starting at that point. Otherwise, it may find the sequence "//" and think it's found the beginning of a comment, when it's really inside a string literal and it just failed to notice the opening quotation mark.

当然,这实际上要复杂得多,但我只是说明为什么StringTokenizer, split() 和扫描器不适合此类任务。然而,可以使用Java的正则表达式类用于有限形式的词法分析。事实上,添加Scanner类使得它变得容易得多,因为添加了新的Matcher API以支持它,即region和 usePattern()方法。下面是一个基于Java的regex类构建的初级扫描器的示例。

It's actually much more complicated than that, of course, but I'm just illustrating why the built-in tools like StringTokenizer, split() and Scanner aren't suitable for this kind of task. It is, however, possible to use Java's regex classes for a limited form of lexical analysis. In fact, the addition of the Scanner class made it much easier, because of the new Matcher API that was added to support it, i.e., regions and the usePattern() method. Here's an example of a rudimentary scanner built on top of Java's regex classes.

import java.util.*;
import java.util.regex.*;

public class RETokenizer
{
  static List<Token> tokenize(String source, List<Rule> rules)
  {
    List<Token> tokens = new ArrayList<Token>();
    int pos = 0;
    final int end = source.length();
    Matcher m = Pattern.compile("dummy").matcher(source);
    m.useTransparentBounds(true).useAnchoringBounds(false);
    while (pos < end)
    {
      m.region(pos, end);
      for (Rule r : rules)
      {
        if (m.usePattern(r.pattern).lookingAt())
        {
          tokens.add(new Token(r.name, m.start(), m.end()));
          pos = m.end();
          break;
        }
      }
      pos++;  // bump-along, in case no rule matched
    }
    return tokens;
  }

  static class Rule
  {
    final String name;
    final Pattern pattern;

    Rule(String name, String regex)
    {
      this.name = name;
      pattern = Pattern.compile(regex);
    }
  }

  static class Token
  {
    final String name;
    final int startPos;
    final int endPos;

    Token(String name, int startPos, int endPos)
    {
      this.name = name;
      this.startPos = startPos;
      this.endPos = endPos;
    }

    @Override
    public String toString()
    {
      return String.format("Token [%2d, %2d, %s]", startPos, endPos, name);
    }
  }

  public static void main(String[] args) throws Exception
  {
    List<Rule> rules = new ArrayList<Rule>();
    rules.add(new Rule("WORD", "[A-Za-z]+"));
    rules.add(new Rule("QUOTED", "\"[^\"]*+\""));
    rules.add(new Rule("COMMENT", "//.*"));
    rules.add(new Rule("WHITESPACE", "\\s+"));

    String str = "foo //in \"comment\"\nbar \"no //comment\" end";
    List<Token> result = RETokenizer.tokenize(str, rules);
    for (Token t : result)
    {
      System.out.println(t);
    }
  }
}

顺便说一下,这是对于 lookingAt()方法找到的唯一好用。 :D

This, by the way, is the only good use I've ever found for the lookingAt() method. :D

这篇关于如何使用Java的Scanner类和正则表达式对输入进行符号化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆