当存在相似的模式时,ANTLR与模板的匹配不正确 [英] ANTLR does not match the pattens properly when there are similar patterns

查看:91
本文介绍了当存在相似的模式时,ANTLR与模板的匹配不正确的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用ANTLR来解析一些查询.

I am using ANTLR to parse some queries.

这是我的ANTLR g4:

Here is my ANTLR g4:

propTest
  : objectPath NOT? (EQ|NEQ) primitiveLiteral    # propTestEqual
  | objectPath NOT? (EQ|NEQ) 'wwww'              # propTestThlEqual
  ;

primitiveLiteral
  : orderableLiteral
  | BoolLiteral
  ;

primitiveLiteral
  : orderableLiteral
  ;

orderableLiteral
  : StringLiteral
  ;

StringLiteral
  : QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
  ;

当我喂它时的问题:

[network-traffic:src_port = '123]

我希望比赛发生在

: objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual

但是它什么都没有匹配,但是只要我删除

but it does not match anything but as soon as I remove

| objectPath NOT? (EQ|NEQ) 'wwww'   # propTestThlEqual

然后比赛发生在

: objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual

知道发生了什么吗?

**更新

grammar STIXPattern;

pattern
  : observationExpressions EOF
  ;

observationExpressions
  : <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
  | observationExpressionOr                                               #observationExpressionOr_
  ;

observationExpressionOr
  : <assoc=left> observationExpressionOr OR observationExpressionOr     #observationExpressionOred
  | observationExpressionAnd                                            #observationExpressionAnd_
  ;

observationExpressionAnd
  : <assoc=left> observationExpressionAnd AND observationExpressionAnd  #observationExpressionAnded
  | observationExpression                                               #observationExpression_
  ;

observationExpression
  : LBRACK comparisonExpression RBRACK        # observationExpressionSimple
  | LPAREN observationExpressions RPAREN      # observationExpressionCompound
  | observationExpression startStopQualifier  # observationExpressionStartStop
  | observationExpression withinQualifier     # observationExpressionWithin
  | observationExpression repeatedQualifier   # observationExpressionRepeated
  ;

comparisonExpression
  : <assoc=left> comparisonExpression OR comparisonExpression         #comparisonExpressionOred
  | comparisonExpressionAnd                                           #comparisonExpressionAnd_
  ;

comparisonExpressionAnd
  : <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd  #comparisonExpressionAnded
  | propTest                                                          #comparisonExpressionAndpropTest
  ;

propTest
  : objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual
  | objectPath NOT? (EQ|NEQ) objectPathThl    # propTestThlEqual

  ;

startStopQualifier
  : START TimestampLiteral STOP TimestampLiteral
  ;

withinQualifier
  : WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
  ;

repeatedQualifier
  : REPEATS IntPosLiteral TIMES
  ;

objectPath
  : objectType COLON firstPathComponent objectPathComponent?
  ;

objectPathThl
  : varThlType DOT firstPathComponent objectPathComponent?
  ;

objectType
  : IdentifierWithoutHyphen
  | IdentifierWithHyphen
  ;

varThlType
  : IdentifierWithoutHyphen
  | IdentifierWithHyphen
  ;

firstPathComponent
  : IdentifierWithoutHyphen
  | StringLiteral
  ;

objectPathComponent
  : <assoc=left> objectPathComponent objectPathComponent  # pathStep
  | '.' (IdentifierWithoutHyphen | StringLiteral)         # keyPathStep
  | LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK  # indexPathStep
  ;

setLiteral
  : LPAREN RPAREN
  | LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
  ;

primitiveLiteral
  : orderableLiteral
  | BoolLiteral
  ;

orderableLiteral
  : IntPosLiteral
  | IntNegLiteral
  | FloatPosLiteral
  | FloatNegLiteral
  | StringLiteral
  | BinaryLiteral
  | HexLiteral
  | TimestampLiteral
  ;

IntNegLiteral :
  '-' ('0' | [1-9] [0-9]*)
  ;

IntPosLiteral :
  '+'? ('0' | [1-9] [0-9]*)
  ;

FloatNegLiteral :
  '-' [0-9]* '.' [0-9]+
  ;

FloatPosLiteral :
  '+'? [0-9]* '.' [0-9]+
  ;

HexLiteral :
  'h' QUOTE TwoHexDigits* QUOTE
  ;

BinaryLiteral :
  'b' QUOTE
  ( Base64Char Base64Char Base64Char Base64Char )*
  ( (Base64Char Base64Char Base64Char Base64Char )
  | (Base64Char Base64Char Base64Char ) '='
  | (Base64Char Base64Char ) '=='
  )
  QUOTE
  ;

StringLiteral :
  QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
  ;


BoolLiteral :
  TRUE | FALSE
  ;

TimestampLiteral :
  't' QUOTE
  [0-9] [0-9] [0-9] [0-9] HYPHEN
  ( ('0' [1-9]) | ('1' [012]) ) HYPHEN
  ( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
  'T'
  ( ([01] [0-9]) | ('2' [0-3]) ) COLON
  [0-5] [0-9] COLON
  ([0-5] [0-9] | '60')
  (DOT [0-9]+)?
  'Z'
  QUOTE
  ;

//////////////////////////////////////////////
// Keywords

AND:  'AND' ;
OR:  'OR' ;
NOT:  'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE:  'LIKE' ;
MATCHES:  'MATCHES' ;
ISSUPERSET:  'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST:  'LAST' ;
IN:  'IN' ;
START:  'START' ;
STOP:  'STOP' ;
SECONDS:  'SECONDS' ;
TRUE:  'true' ;
FALSE:  'false' ;
WITHIN:  'WITHIN' ;
REPEATS:  'REPEATS' ;
TIMES:  'TIMES' ;

// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
  [a-zA-Z_] [a-zA-Z0-9_]*
  ;

IdentifierWithHyphen :
  [a-zA-Z_] [a-zA-Z0-9_-]*
  ;

EQ        :   '=' | '==';
NEQ       :   '!=' | '<>';
LT        :   '<';
LE        :   '<=';
GT        :   '>';
GE        :   '>=';

QUOTE     : '\'';
COLON     : ':' ;
DOT       : '.' ;
COMMA     : ',' ;
RPAREN    : ')' ;
LPAREN    : '(' ;
RBRACK    : ']' ;
LBRACK    : '[' ;
PLUS      : '+' ;
HYPHEN    : MINUS ;
MINUS     : '-' ;
POWER_OP  : '^' ;
DIVIDE    : '/' ;
ASTERISK  : '*';

fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];

// Whitespace and comments
//
WS  :  [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
    ;

COMMENT
    :   '/*' .*? '*/' -> skip
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;

// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
    : .
    ;

推荐答案

您不匹配,因为您没有'123

You're not matching because you don't have the closing ' on the '123

这是您的令牌流(以您的示例为例)(我还提供了错误消息)

Here's your token stream (for your example) (I also included the error message)

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<'''>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''

它与输入的 [network-traffic:src_port ='123']

(我添加了您的 | objectPath NOT?(EQ | NEQ)'wwww'#propTestThlEqual1 替代popTest,它与上面的字符串匹配.

(I added your | objectPath NOT? (EQ | NEQ) 'wwww' # propTestThlEqual1 alternative to popTest, and it matches the string above.

这是带有添加的'

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:32=''123'',<StringLiteral>,1:28]
[@6,33:33=']',<']'>,1:33]
[@7,34:33='<EOF>',<EOF>,1:34]

令牌规则将选择最长的匹配项.

Token rules will choose the longest match.

对您的语法的评论...

A comment on your grammar...

您可能想将QUOTE片段化,以使其本身无法被识别为令牌(但仅在您引用它的Lexer规则内)(任何以大写字母开头的规则都是Lexer规则(习惯将Lexer Rule设置为大写,但这是重要"的第一个字母)

You probably want to make QUOTE a fragment, so that it can't be recognized as a toke on it's own (but only within Lexer rules where you reference it) (Any rule beginning with a Capital is a Lexer Rule (it's customary to make Lexer Rules all caps, but it's the first letter that "matters")

如果我将 QUOTE 规则更改为 fragment QUOTE:'\'';

然后tokenStream是:(再次包含错误消息)

Then the tokenStream is: (including the error message again)

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<InvalidCharacter>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''

您会得到相同的没有可行的选择"错误,但您还会得到一个 InvalidCharacter:.; 令牌,该令牌有助于提示问题.

You get the same "no viable alternative" error, but you also get an InvalidCharacter: .; token that helps hint at the problem.

关于在propTest规则上只有一个替代项时为什么会得到不同结果的问题……这很有趣.如果只有一条规则,那么在您的示例中,我得到一个外部输入''',期望{警告在您的评论中.

As to the question of why you get different results when there is a single alternative on the propTest rule... That's rather interesting. When have the single rule, then I get a extraneous input ''' expecting { warning on your example, and a mismatched input ']' expecting { warning on the second example in your comments.

这两者都是ANTLR尝试更好的错误恢复的结果.(请参阅实用程序设计人员撰写的"The Definitve ANTLR 4 Refenence"一书中的从子规则中的错误中恢复"和错误的队列"部分(如果您要做很​​多事情,这是一本必不可少的"书)ANTLR)).现在看来很明显,当ANTLR有多个规则替代方案时,它实际上无法进行这些恢复尝试.(我确实看过ATN图,但是它们并没有真正涵盖这些错误恢复路径,因此差异是无趣的")

Both of these are a result of ANTLR's attempts at better error recovery. (See sections: "Recovering from Errors in SubRules" and "A Parade of Errors" in "The Definitve ANTLR 4 Refenence" from Pragmatic Programmers (pretty much a "must have" book if you are going to do much with ANTLR)). It seems pretty obvious now, that when ANTLR has multiple rule alternatives, it can't really engage in these recovery attempts. (I did look at the ATN graphs, but they don't really cover these error recovery paths, so the differences were "uninteresting")

由于只有propTest解析器规则的单个替代版本会显示这些警告,因此处理这些警告实际上可能是在重点之外".只需解决因错误输入而出现的没有可行的替代方法错误,然后继续操作即可.

Since you'd only see those warnings with the single alternative version of your propTest parser rule, dealing with them may actually be "beside the point". Just go with the no viable alternative error you'll get for the erroneous input and move on.

仅供参考...如果您想采用确实可以使用这些错误恢复策略的选项,但是要意识到这些警告,则可以实现自己的 ErrorListener 类.

FYI... if you want to pursue an option that does give use these error recovery strategies, but be made aware of these warnings, you can implement your own ErrorListener class.

我几乎总是这样做,所以我可以更好地控制捕获所有错误和警告,并决定如何在UI中进行管理.默认的ErrorHandler几乎只是将消息吐出到控制台.

I've pretty much always done this just so I was in more control of capturing all errors and warnings and deciding how to manage them in the UI. The default ErrorHandler pretty much just spits messages out to the console.

这篇关于当存在相似的模式时,ANTLR与模板的匹配不正确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆