使用Antlr4解析任意定界符 [英] Parse arbitrary delimiter character using Antlr4

查看：93 发布时间：2021/4/7 20:28:23 regex antlr4

本文介绍了使用Antlr4解析任意定界符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在Antlr4中创建一个语法，该语法接受以任意字符定界的正则表达式 (类似于Perl).我该如何实现?

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?

要清楚:我的问题不是正则表达式本身(我实际上不在Antlr中处理，而在访问者中处理)，而是定界符.我可以轻松地为词法分析器定义以下规则:

To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:

REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;

这将使用正斜杠作为分隔符(就像在Perl中通常使用的一样).但是，我也希望能够将正则表达式编写为 m〜regexp〜(在Perl中也是可能的).

This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).

如果必须使用正则表达式本身来解决此问题，则可以使用像这样的反向引用:

If I had to solve this using a regular expression itself, I would use a backreference like this:

m(.)(.+?)\1

(它是一个"m"，后跟一个任意字符，然后是表达式，然后是相同的任意字符).但是，反向引用似乎在Antlr4中不可用.

(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.

当我可以使用成对的括号，即 m(regexp)或 m {regexp} 时，效果会更好.但是，由于可能的括号类型数量很少，可以通过简单地枚举所有不同的变体来解决.

It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.

这可以用Antlr4解决吗?

Can this be solved with Antlr4?

推荐答案

您可以执行以下操作:

lexer grammar TLexer;

REGEX
 : REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
 | '{' REGEX_ATOM+ '}'
 | '(' REGEX_ATOM+ ')'
 ;

ANY
 : .
 ;

fragment REGEX_DELIMITER
 : [/~@#]
 ;

fragment REGEX_ATOM
 : '\\' .
 | ~[\\]
 ;

如果您运行以下类:

public class Main {

  public static void main(String[] args) throws Exception {

    TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));

    for (Token t : lexer.getAllTokens()) {
      System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
    }
  }
}

您将看到以下输出:

REGEX                /foo/
ANY                   
ANY                  /
ANY                  b
ANY                  a
ANY                  r
ANY                  \
ANY                   
REGEX                ~\~~
ANY                   
REGEX                {mu}
ANY                   
ANY                  (
ANY                  b
ANY                  l
ANY                  a
ANY                  (

{...}?被称为谓词:

({getText().charAt(0)！= _input.LA(1)}?REGEX_ATOM)+ 部分告诉词法分析器只要与 REGEX_DELIMITER 在字符流中不在前面.还有 {getText().charAt(0)== _input.LA(1)}吗?.确保实际上有一个与第一个字符匹配的结束定界符(当然是 REGEX_DELIMITER ).

The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).

经过ANTLR 4.5.3测试

Tested with ANTLR 4.5.3

要获得以 m 开头的定界符以及一些可选的空格，您可以尝试执行以下操作(未经测试！):

And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):

lexer grammar TLexer;

  @lexer::members {
    boolean delimiterAhead(String start) {
      return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
    }
  }

  REGEX
   : '/' ( '\\' . | ~[/\\] )+ '/'
   | 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
   | 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
   | 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
   ;

  ANY
   : .
   ;

  fragment REGEX_DELIMITER
   : [~@#]
   ;

  fragment SPACES
   : [ \t]+
   ;

这篇关于使用Antlr4解析任意定界符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Antlr4解析任意定界符 [英] Parse arbitrary delimiter character using Antlr4

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Antlr4解析任意定界符 [英] Parse arbitrary delimiter character using Antlr4

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭