解析字符串antlr [英] Parse string antlr

查看:27
本文介绍了解析字符串antlr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将字符串作为解析器规则而不是词法分析器,因为字符串可能包含带有表达式的转义符,例如 "The variable is \(variable)".

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".

string
 : '"' character* '"'
 ;

character
 : escapeSequence
 | .
 ;

escapeSequence
 : '\(' expression ')'
 ;

IDENTIFIER
 : [a-zA-Z][a-zA-Z0-9]*
 ;

WHITESPACE
 : [ \r\t,] -> skip
 ;

这不起作用,因为 . 匹配任何标记而不是任何字符,因此将匹配许多标识符并且将完全忽略空格.

This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.

如何解析可以包含表达式的字符串?

How can I parse strings that can have expressions inside of them?

查看 Swift 和 Javascript 的解析器,这两种语言都支持这样的功能,我无法弄清楚它们是如何工作的.据我所知,他们只是输出一个字符串,例如我的字符串,其中包含(变量)",而实际上无法将变量解析为它自己的东西.

Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.

推荐答案

这个问题可以使用词法模式来解决,方法是让一种模式用于字符串内部,一种(或多种)模式用于外部.在外面看到 " 会切换到内部模式,看到 \(" 会切换回外面.唯一复杂的部分是在外面看到 ) :有时它应该切换回内部(因为它对应于 \() 有时它不应该(当它对应时)到一个普通的 ().

This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().

实现这一目标的最基本方法是这样的:

The most basic way to achieve this would be like this:

词法分析器:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

解析器:

parser grammar StringParser;

options {
    tokenVocab = 'StringLexer';
}

start: exp EOF ;

exp : '(' exp ')'
    | IDENTIFIER
    | DQUOTE stringContents* DQUOTE
    ;

stringContents : TEXT
               | ESCAPE_SEQUENCE
               | '\\(' exp ')'
               ;

在这里,我们每次看到 (\( 时都会推送默认模式,并在每次看到 ) 时弹出模式.这样,只有当堆栈顶部的模式是字符串模式时,它才会返回字符串内部,只有在没有任何未关闭的 ( 自上一个 >\(.

Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.

这种方法有效,但有一个缺点,即不匹配的 ) 将导致空堆栈异常而不是正常的语法错误,因为我们在空堆栈上调用 popMode堆栈.

This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.

为了避免这种情况,我们可以添加一个成员来跟踪我们在括号内的嵌套深度,并且在嵌套级别为 0 时(即如果堆栈为空)不弹出堆栈:

To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):

@members {
    int nesting = 0;
}

LPAR: '(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};
RPAR: ')' {
    if (nesting > 0) {
        nesting--;
        popMode();
    }
};

mode IN_STRING;

BACKSLASH_PAREN: '\\(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};

(我省略的部分与之前的版本相同).

(The parts I left out are the same as in the previous version).

这有效并为不匹配的 ) 产生正常的语法错误.但是,它包含操作,因此不再是语言不可知的,如果您打算使用多种语言的语法,这只是一个问题(并且取决于语言,您甚至可能很幸运,并且代码可能在所有语言中都有效)您的目标语言).

This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).

如果你想避免动作,最后一种选择是有三种模式:一种用于任何字符串外部的代码,一种用于字符串内部,一种用于\()\()代码>.第三个几乎和外面的一样,只是它在看到括号时会push和pop模式,而外面的不会.为了让两种模式产生相同类型的令牌,第三种模式中的规则都会调用type().这将如下所示:

If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

mode EMBEDDED;

E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;

请注意,我们现在不能再在解析器语法中使用字符串文字,因为当使用相同的字符串文字定义多个词法分析器规则时,不能使用字符串文字.所以现在我们必须在解析器中使用 LPAR 而不是 '(' 等等(我们已经不得不为 DQUOTE同理).

Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).

由于此版本涉及大量重复(尤其是随着标记数量的增加)并阻止在解析器语法中使用字符串文字,因此我通常更喜欢带有操作的版本.

Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.

还可以在 在 GitHub 上找到所有三个替代方案的完整代码.

这篇关于解析字符串antlr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆