存在相似模式时,ANTLR 无法正确匹配模式 [英] ANTLR does not match the pattens properly when there are similar patterns

查看:19
本文介绍了存在相似模式时,ANTLR 无法正确匹配模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 ANTLR 来解析一些查询.

这是我的 ANTLR g4:

propTest: objectPath 不是?(EQ|NEQ)primitiveLiteral # propTestEqual|对象路径不是?(EQ|NEQ) 'wwww' # propTestThlEqual;原始文字: 可订购文字|布尔文字;原始文字: 可订购文字;可订购文字: 字符串字面量;字符串字面量: QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE;

我喂它时的问题:

[网络流量:src_port = '123]

我希望比赛发生在

: objectPath 不是?(EQ|NEQ)primitiveLiteral # propTestEqual

但它不匹配任何东西,但一旦我删除

<代码>|对象路径不是?(EQ|NEQ) 'wwww' # propTestThlEqual

然后匹配发生在

: objectPath 不是?(EQ|NEQ)primitiveLiteral # propTestEqual

知道发生了什么吗?

** 更新

语法 STIXPattern;图案: 观察表达式 EOF;观察表达式:<assoc=left>观察表达式 FOLLOWEDBY 观察表达式 #observationExpressionsFollowedBY|观察表达或#observationExpressionOr_;观察表达式或:<assoc=left>observationExpressionOr OR 观察ExpressionOr #observationExpressionOred|observationExpressionAnd #observationExpressionAnd_;观察表达式与:<assoc=left>observationExpressionAnd AND 观察ExpressionAnd #observationExpressionAnded|观察表达#observationExpression_;观察表达式: LBRACK 比较表达式 RBRACK # 观察表达式简单|LPAREN 观察表达式 RPAREN #观察表达式化合物|观察表达开始停止限定符#观察表达开始停止|观察表达内限定符#观察表达内|观察表达式重复限定符#观察表达式重复;比较表达式:<assoc=left>compareExpression OR compareExpression #comparisonExpressionOred|compareExpressionAnd #comparisonExpressionAnd_;比较表达式And:<assoc=left>compareExpressionAnd AND compareExpressionAnd #comparisonExpressionAnded|propTest #comparisonExpressionAndpropTest;道具测试: objectPath 不是?(EQ|NEQ)primitiveLiteral # propTestEqual|对象路径不是?(EQ|NEQ) objectPathThl # propTestThlEqual;开始停止限定符: START TimestampLiteral STOP TimestampLiteral;限定符内: WITHIN (IntPosLiteral|FloatPosLiteral) 秒;重复限定符:重复 IntPosLiteral 时间;对象路径: objectType COLON firstPathComponent objectPathComponent?;对象路径:varThlType DOT firstPathComponent objectPathComponent?;对象类型: 没有连字符的标识符|带连字符的标识符;变量类型: 没有连字符的标识符|带连字符的标识符;第一个路径组件: 没有连字符的标识符|字符串字面量;对象路径组件:<assoc=left>objectPathComponent objectPathComponent # pathStep|'.'(IdentifierWithoutHyphen | StringLiteral) # keyPathStep|LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep;设置文字: LPAREN RPAREN|LPAREN 原始文字(COMMA 原始文字)* RPAREN;原始文字: 可订购文字|布尔文字;可订购文字: IntPosLiteral|IntNegLiteral|FloatPosLiteral|FloatNegLiteral|字符串字面量|二进制文字|十六进制文字|时间戳文字;IntNegLiteral :'-' ('0' | [1-9] [0-9]*);IntPosLiteral :'+'?('0' | [1-9] [0-9]*);FloatNegLiteral :'-' [0-9]* '.'[0-9]+;FloatPosLiteral :'+'?[0-9]* '.'[0-9]+;十六进制:'h' QUOTE TwoHexDigits* QUOTE;二进制文字:'b' 引用( Base64Char Base64Char Base64Char Base64Char )*( (Base64Char Base64Char Base64Char Base64Char )|(Base64Char Base64Char Base64Char) '='|(Base64Char Base64Char ) '==')引用;字符串字面量 :引用 ( ~['\\] | '\\\'' | '\\\\' )* 引用;布尔文字:真|错误的;时间戳文字:'t' 引用[0-9] [0-9] [0-9] [0-9] 连字符( ('0' [1-9]) | ('1' [012]) ) 连字符( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )'T'( ([01] [0-9]) | ('2' [0-3]) ) 冒号[0-5] [0-9] 冒号([0-5] [0-9] | '60')(点 [0-9]+)?'Z'引用;////////////////////////////////////////////////关键字与:'与';或:'或';不是这样的' ;关注:'关注';喜欢:'喜欢';匹配:'匹配';ISSUPERSET: 'ISSUPERSET';ISSUBSET: 'ISSUBSET';最后:'最后';输入:'输入';开始:'开始';停止:'停止';秒:'秒';真: '真' ;假:'假';内:'内';重复:'重复';时代:'时代';//在关键字之后,因此词法分析器不会将它们标记为标识符.//对象类型可能有不带引号的连字符,但属性名//(在对象路径中)不能.没有连字符的标识符:[a-zA-Z_] [a-zA-Z0-9_]*;IdentifierWithHyphen :[a-zA-Z_] [a-zA-Z0-9_-]*;情商:'=' |'==';NEQ : '!=' |'<>';LT : '<';乐:'<=';GT : '>';GE : '>=';引用     : '\'';冒号     : ':' ;点:'.';逗号 : ',' ;RPAREN : ')' ;LPAREN : '(' ;RBRACK : ']' ;LBRACK : '[' ;加号:'+';连字符:减号;减     : '-' ;POWER_OP : '^' ;划分 : '/' ;星号:'*';片段 HexDigit: [A-Fa-f0-9];片段 TwoHexDigits: HexDigit HexDigit;片段 Base64Char: [A-Za-z0-9+/];//空格和注释//WS : [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2007\u2008\u2008\u2082u20082u202u2008\u2082u20f2u3000]+ ->跳过;评论: '/*' .*?'*/' ->跳过;LINE_COMMENT: '//' ~[\r\n]* ->跳过;//Catch-all 以防止词法分析器默默地吃掉不可用的字符.无效字符:.;

解决方案

您不匹配,因为您在 '123<上没有结束 '/p>

这是您的令牌流(对于您的示例)(我还包含了错误消息)

[@0,0:0='[',<'['>,1:0][@1,1:15='网络流量',,1:1][@2,16:16=':',<':'>,1:16][@3,17:24='src_port',,1:17][@4,26:26='=',,1:26][@5,28:28=''',<'''>,1:28][@6,29:31='123',,1:29][@7,32:32=']',<']'>,1:32][@8,33:32='<EOF>',<EOF>,1:33]第 1:28 行在输入 'network-traffic:src_port='' 处没有可行的替代方案

它与输入匹配良好 [network-traffic:src_port = '123']

(我添加了你的 | objectPath NOT? (EQ | NEQ) 'ww' # propTestThlEqual1 替代 popTest,它匹配上面的字符串.

这是添加了'

的tokenStream

[@0,0:0='[',<'['>,1:0][@1,1:15='网络流量',,1:1][@2,16:16=':',<':'>,1:16][@3,17:24='src_port',,1:17][@4,26:26='=',,1:26][@5,28:32=''123'',,1:28][@6,33:33=']',<']'>,1:33][@7,34:33='<EOF>',<EOF>,1:34]

令牌规则将选择最长的匹配.

对你的语法的评论...

您可能想让 QUOTE 成为一个片段,这样它就不能被识别为它自己的标记(但只能在您引用它的词法分析器规则中)(任何以大写开头的规则都是词法分析器规则(它是习惯上使 Lexer Rules 全部大写,但第一个字母很重要")

如果我将 QUOTE 规则更改为 fragment QUOTE: '\'';

那么tokenStream就是:(再次包含错误信息)

[@0,0:0='[',<'['>,1:0][@1,1:15='网络流量',,1:1][@2,16:16=':',<':'>,1:16][@3,17:24='src_port',,1:17][@4,26:26='=',,1:26][@5,28:28=''',,1:28][@6,29:31='123',,1:29][@7,32:32=']',<']'>,1:32][@8,33:32='<EOF>',<EOF>,1:33]第 1:28 行在输入 'network-traffic:src_port='' 处没有可行的替代方案

你会得到同样的没有可行的替代方案";错误,但您也会得到一个 InvalidCharacter: .; 标记,有助于提示问题.


至于在 propTest 规则上只有一个选项时为什么会得到不同结果的问题……这很有趣.当有单个规则时,我会在您的示例中收到 extraeous input ''' expected { 警告,在第二个示例中收到 mismatched input ']' expected { 警告在您的评论中.

这两者都是 ANTLR 尝试更好的错误恢复的结果.(请参阅 Pragmatic Programmers 的The Definitve ANTLR 4 Refenence"中的Recovering from Errors in SubRules"和A Parade of Errors"(如果您要使用ANTLR)).现在看起来很明显,当 ANTLR 有多个规则选择时,它不能真正参与这些恢复尝试.(我确实看过 ATN 图,但它们并没有真正涵盖这些错误恢复路径,因此差异无趣")

由于您只会在 propTest 解析器规则的单一替代版本中看到这些警告,因此处理它们实际上可能是题外".只需使用 no可靠替代 错误,您会因输入错误而获得并继续.

仅供参考...如果您想寻求一个确实使用这些错误恢复策略的选项,但要注意这些警告,您可以实现自己的 ErrorListener 类.

我几乎总是这样做,所以我可以更好地控制捕获所有错误和警告并决定如何在 UI 中管理它们.默认的 ErrorHandler 几乎只是将消息输出到控制台.

I am using ANTLR to parse some queries.

Here is my ANTLR g4:

propTest
  : objectPath NOT? (EQ|NEQ) primitiveLiteral    # propTestEqual
  | objectPath NOT? (EQ|NEQ) 'wwww'              # propTestThlEqual
  ;

primitiveLiteral
  : orderableLiteral
  | BoolLiteral
  ;

primitiveLiteral
  : orderableLiteral
  ;

orderableLiteral
  : StringLiteral
  ;

StringLiteral
  : QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
  ;

The issue when I feed it with:

[network-traffic:src_port = '123]

I expect match happens on

: objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual

but it does not match anything but as soon as I remove

| objectPath NOT? (EQ|NEQ) 'wwww'   # propTestThlEqual

then the match happens on

: objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual

Any idea what is going on?

** update

grammar STIXPattern;

pattern
  : observationExpressions EOF
  ;

observationExpressions
  : <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
  | observationExpressionOr                                               #observationExpressionOr_
  ;

observationExpressionOr
  : <assoc=left> observationExpressionOr OR observationExpressionOr     #observationExpressionOred
  | observationExpressionAnd                                            #observationExpressionAnd_
  ;

observationExpressionAnd
  : <assoc=left> observationExpressionAnd AND observationExpressionAnd  #observationExpressionAnded
  | observationExpression                                               #observationExpression_
  ;

observationExpression
  : LBRACK comparisonExpression RBRACK        # observationExpressionSimple
  | LPAREN observationExpressions RPAREN      # observationExpressionCompound
  | observationExpression startStopQualifier  # observationExpressionStartStop
  | observationExpression withinQualifier     # observationExpressionWithin
  | observationExpression repeatedQualifier   # observationExpressionRepeated
  ;

comparisonExpression
  : <assoc=left> comparisonExpression OR comparisonExpression         #comparisonExpressionOred
  | comparisonExpressionAnd                                           #comparisonExpressionAnd_
  ;

comparisonExpressionAnd
  : <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd  #comparisonExpressionAnded
  | propTest                                                          #comparisonExpressionAndpropTest
  ;

propTest
  : objectPath NOT? (EQ|NEQ) primitiveLiteral       # propTestEqual
  | objectPath NOT? (EQ|NEQ) objectPathThl    # propTestThlEqual

  ;

startStopQualifier
  : START TimestampLiteral STOP TimestampLiteral
  ;

withinQualifier
  : WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
  ;

repeatedQualifier
  : REPEATS IntPosLiteral TIMES
  ;

objectPath
  : objectType COLON firstPathComponent objectPathComponent?
  ;

objectPathThl
  : varThlType DOT firstPathComponent objectPathComponent?
  ;

objectType
  : IdentifierWithoutHyphen
  | IdentifierWithHyphen
  ;

varThlType
  : IdentifierWithoutHyphen
  | IdentifierWithHyphen
  ;

firstPathComponent
  : IdentifierWithoutHyphen
  | StringLiteral
  ;

objectPathComponent
  : <assoc=left> objectPathComponent objectPathComponent  # pathStep
  | '.' (IdentifierWithoutHyphen | StringLiteral)         # keyPathStep
  | LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK  # indexPathStep
  ;

setLiteral
  : LPAREN RPAREN
  | LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
  ;

primitiveLiteral
  : orderableLiteral
  | BoolLiteral
  ;

orderableLiteral
  : IntPosLiteral
  | IntNegLiteral
  | FloatPosLiteral
  | FloatNegLiteral
  | StringLiteral
  | BinaryLiteral
  | HexLiteral
  | TimestampLiteral
  ;

IntNegLiteral :
  '-' ('0' | [1-9] [0-9]*)
  ;

IntPosLiteral :
  '+'? ('0' | [1-9] [0-9]*)
  ;

FloatNegLiteral :
  '-' [0-9]* '.' [0-9]+
  ;

FloatPosLiteral :
  '+'? [0-9]* '.' [0-9]+
  ;

HexLiteral :
  'h' QUOTE TwoHexDigits* QUOTE
  ;

BinaryLiteral :
  'b' QUOTE
  ( Base64Char Base64Char Base64Char Base64Char )*
  ( (Base64Char Base64Char Base64Char Base64Char )
  | (Base64Char Base64Char Base64Char ) '='
  | (Base64Char Base64Char ) '=='
  )
  QUOTE
  ;

StringLiteral :
  QUOTE ( ~['\\] | '\\\'' | '\\\\' )* QUOTE
  ;


BoolLiteral :
  TRUE | FALSE
  ;

TimestampLiteral :
  't' QUOTE
  [0-9] [0-9] [0-9] [0-9] HYPHEN
  ( ('0' [1-9]) | ('1' [012]) ) HYPHEN
  ( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
  'T'
  ( ([01] [0-9]) | ('2' [0-3]) ) COLON
  [0-5] [0-9] COLON
  ([0-5] [0-9] | '60')
  (DOT [0-9]+)?
  'Z'
  QUOTE
  ;

//////////////////////////////////////////////
// Keywords

AND:  'AND' ;
OR:  'OR' ;
NOT:  'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE:  'LIKE' ;
MATCHES:  'MATCHES' ;
ISSUPERSET:  'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST:  'LAST' ;
IN:  'IN' ;
START:  'START' ;
STOP:  'STOP' ;
SECONDS:  'SECONDS' ;
TRUE:  'true' ;
FALSE:  'false' ;
WITHIN:  'WITHIN' ;
REPEATS:  'REPEATS' ;
TIMES:  'TIMES' ;

// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
  [a-zA-Z_] [a-zA-Z0-9_]*
  ;

IdentifierWithHyphen :
  [a-zA-Z_] [a-zA-Z0-9_-]*
  ;

EQ        :   '=' | '==';
NEQ       :   '!=' | '<>';
LT        :   '<';
LE        :   '<=';
GT        :   '>';
GE        :   '>=';

QUOTE     : '\'';
COLON     : ':' ;
DOT       : '.' ;
COMMA     : ',' ;
RPAREN    : ')' ;
LPAREN    : '(' ;
RBRACK    : ']' ;
LBRACK    : '[' ;
PLUS      : '+' ;
HYPHEN    : MINUS ;
MINUS     : '-' ;
POWER_OP  : '^' ;
DIVIDE    : '/' ;
ASTERISK  : '*';

fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];

// Whitespace and comments
//
WS  :  [ \t\r\n\u000B\u000C\u0085\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+ -> skip
    ;

COMMENT
    :   '/*' .*? '*/' -> skip
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;

// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
    : .
    ;

解决方案

You're not matching because you don't have the closing ' on the '123

Here's your token stream (for your example) (I also included the error message)

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<'''>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''

It matches fine with the input [network-traffic:src_port = '123']

(I added your | objectPath NOT? (EQ | NEQ) 'wwww' # propTestThlEqual1 alternative to popTest, and it matches the string above.

This is the tokenStream with the added '

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:32=''123'',<StringLiteral>,1:28]
[@6,33:33=']',<']'>,1:33]
[@7,34:33='<EOF>',<EOF>,1:34]

Token rules will choose the longest match.

A comment on your grammar...

You probably want to make QUOTE a fragment, so that it can't be recognized as a toke on it's own (but only within Lexer rules where you reference it) (Any rule beginning with a Capital is a Lexer Rule (it's customary to make Lexer Rules all caps, but it's the first letter that "matters")

If I change the QUOTE rule to fragment QUOTE: '\'';

Then the tokenStream is: (including the error message again)

[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<InvalidCharacter>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''

You get the same "no viable alternative" error, but you also get an InvalidCharacter: .; token that helps hint at the problem.


As to the question of why you get different results when there is a single alternative on the propTest rule... That's rather interesting. When have the single rule, then I get a extraneous input ''' expecting { warning on your example, and a mismatched input ']' expecting { warning on the second example in your comments.

Both of these are a result of ANTLR's attempts at better error recovery. (See sections: "Recovering from Errors in SubRules" and "A Parade of Errors" in "The Definitve ANTLR 4 Refenence" from Pragmatic Programmers (pretty much a "must have" book if you are going to do much with ANTLR)). It seems pretty obvious now, that when ANTLR has multiple rule alternatives, it can't really engage in these recovery attempts. (I did look at the ATN graphs, but they don't really cover these error recovery paths, so the differences were "uninteresting")

Since you'd only see those warnings with the single alternative version of your propTest parser rule, dealing with them may actually be "beside the point". Just go with the no viable alternative error you'll get for the erroneous input and move on.

FYI... if you want to pursue an option that does give use these error recovery strategies, but be made aware of these warnings, you can implement your own ErrorListener class.

I've pretty much always done this just so I was in more control of capturing all errors and warnings and deciding how to manage them in the UI. The default ErrorHandler pretty much just spits messages out to the console.

这篇关于存在相似模式时,ANTLR 无法正确匹配模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆