使用 Antlr4 解析任意分隔符 [英] Parse arbitrary delimiter character using Antlr4

查看：29 发布时间：2021/11/11 4:06:23 regex antlr4

本文介绍了使用 Antlr4 解析任意分隔符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在 Antlr4 中创建一个语法，它接受正则表达式由任意字符分隔(类似于 Perl).我怎样才能做到这一点?

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?

要明确:我的问题不是正则表达式本身(我实际上不在 Antlr 中处理，而是在访问者中处理)，而是分隔符.我可以轻松地为词法分析器定义以下规则:

To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:

REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;

这将使用正斜杠作为分隔符(就像在 Perl 中常用的那样).但是，我也希望能够将正则表达式编写为 m~regexp~(这在 Perl 中也是可能的).

This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).

如果我必须使用正则表达式本身来解决这个问题，我会使用这样的反向引用:

If I had to solve this using a regular expression itself, I would use a backreference like this:

m(.)(.+?)\1

(这是一个m"，后跟一个任意字符，后跟表达式，后跟相同的任意字符).但是在 Antlr4 中似乎没有反向引用.

(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.

如果我可以使用括号对，即 m(regexp) 或 m{regexp} 会更好.但是由于可能的括号类型的数量非常少，这可以通过简单地枚举所有不同的变体来解决.

It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.

这个问题可以用 Antlr4 解决吗?

Can this be solved with Antlr4?

推荐答案

你可以这样做:

lexer grammar TLexer;

REGEX
 : REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
 | '{' REGEX_ATOM+ '}'
 | '(' REGEX_ATOM+ ')'
 ;

ANY
 : .
 ;

fragment REGEX_DELIMITER
 : [/~@#]
 ;

fragment REGEX_ATOM
 : '\\' .
 | ~[\\]
 ;

如果您运行以下类:

public class Main {

  public static void main(String[] args) throws Exception {

    TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));

    for (Token t : lexer.getAllTokens()) {
      System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
    }
  }
}

您将看到以下输出:

REGEX                /foo/
ANY                   
ANY                  /
ANY                  b
ANY                  a
ANY                  r
ANY                  \
ANY                   
REGEX                ~\~~
ANY                   
REGEX                {mu}
ANY                   
ANY                  (
ANY                  b
ANY                  l
ANY                  a
ANY                  (

{...}? 被称为谓词:

( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ 部分告诉词法分析器继续匹配字符，只要字符与 REGEX_DELIMITER 在字符流中不在前面.而 {getText().charAt(0) == _input.LA(1)}?. 确保实际上有一个与第一个字符匹配的结束定界符(当然是 REGEX_DELIMITER).

The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).

使用 ANTLR 4.5.3 测试

Tested with ANTLR 4.5.3

并且要获得以 m 开头的分隔符 + 一些可选的空格，您可以尝试这样的操作(未经测试！):

And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):

lexer grammar TLexer;

  @lexer::members {
    boolean delimiterAhead(String start) {
      return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
    }
  }

  REGEX
   : '/' ( '\\' . | ~[/\\] )+ '/'
   | 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
   | 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
   | 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
   ;

  ANY
   : .
   ;

  fragment REGEX_DELIMITER
   : [~@#]
   ;

  fragment SPACES
   : [ \t]+
   ;

这篇关于使用 Antlr4 解析任意分隔符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Antlr4 解析任意分隔符 [英] Parse arbitrary delimiter character using Antlr4

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Antlr4 解析任意分隔符 [英] Parse arbitrary delimiter character using Antlr4

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭