解析字符串antlr [英] Parse string antlr

查看:309
本文介绍了解析字符串antlr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将字符串作为解析器规则而不是词法分析器,因为字符串可能包含转义符(例如"The variable is \(variable)")的转义符.

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".

string
 : '"' character* '"'
 ;

character
 : escapeSequence
 | .
 ;

escapeSequence
 : '\(' expression ')'
 ;

IDENTIFIER
 : [a-zA-Z][a-zA-Z0-9]*
 ;

WHITESPACE
 : [ \r\t,] -> skip
 ;

这不起作用,因为.匹配任何令牌而不是任何字符,因此将匹配许多标识符并且空白将被完全忽略.

This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.

如何解析其中可以包含表达式的字符串?

How can I parse strings that can have expressions inside of them?

看看Swift和Javascript的解析器,这两种语言都支持这样的事情,我不知道它们是如何工作的.据我所知,它们只是输出一个字符串,例如我的字符串中包含(变量)",而实际上并不能将变量解析为它自己的东西.

Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.

推荐答案

使用词汇模式可以解决此问题,方法是在字符串内部使用一种模式,在字符串内部使用一种(或多种)模式.看到外部的"会切换到内部模式,看到\("会切换回外部.唯一复杂的部分是在外部看到):有时它应该切换回内部(因为它对应于\(),有时却应该不返回(当它对应于普通的(时). ).

This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().

实现此目标的最基本方法如下:

The most basic way to achieve this would be like this:

词法分析器:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

解析器:

parser grammar StringParser;

options {
    tokenVocab = 'StringLexer';
}

start: exp EOF ;

exp : '(' exp ')'
    | IDENTIFIER
    | DQUOTE stringContents* DQUOTE
    ;

stringContents : TEXT
               | ESCAPE_SEQUENCE
               | '\\(' exp ')'
               ;

在这里,每次看到(\(时,我们都会推默认模式,而每次看到)时,都会弹出该模式.这样,只有在堆栈顶部的模式是字符串模式时,它才会在字符串内部返回,只有在自上一个\(之后没有剩余的未闭合(时,情况才会如此.

Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.

此方法有效,但缺点是,不匹配的)会导致空堆栈异常,而不是正常的语法错误,因为我们在空堆栈上调用popMode.

This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.

为避免这种情况,我们可以添加一个成员,以跟踪我们在括号内嵌套的深度,并且在嵌套级别为0(即,如果堆栈为空)时不会弹出堆栈:

To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):

@members {
    int nesting = 0;
}

LPAR: '(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};
RPAR: ')' {
    if (nesting > 0) {
        nesting--;
        popMode();
    }
};

mode IN_STRING;

BACKSLASH_PAREN: '\\(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};

(我遗漏的部分与以前的版本相同).

(The parts I left out are the same as in the previous version).

这可以正常工作,并且会为不匹配的)产生正常的语法错误.但是,它包含动作,因此不再与语言无关,这仅是一个问题,如果您打算使用多种语言的语法(并且取决于语言,您甚至可能会很幸运,并且代码可能在所有语言中均有效).您的目标语言).

This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).

如果要避免操作,最后一种选择是使用三种模式:一种用于任何字符串之外的代码,一种用于字符串内部,而另一种用于\()内部.第三个将与外部的几乎相同,除了在看到括号时它将推动并弹出模式,而外部的则不会.为了使两种模式产生相同类型的令牌,第三种模式中的规则将全部调用type().看起来像这样:

If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

mode EMBEDDED;

E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;

请注意,我们现在不能再在解析器语法中使用字符串文字,因为当使用相同的字符串文字定义多个词法分析器规则时,将无法使用字符串文字.因此,现在我们必须在解析器中使用LPAR而不是'(',依此类推(出于相同的原因,我们已经必须针对DQUOTE执行此操作).

Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).

由于该版本涉及大量重复(特别是随着令牌数量的增加),并且阻止了在解析器语法中使用字符串文字,因此我通常更喜欢带有动作的版本.

Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.

还可以在GitHub上的 中找到所有这三种替代方法的完整代码.

The full code for all three alternatives can also be found on GitHub.

这篇关于解析字符串antlr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆