如何使用antlr在两个终端规则中以不同的方式解析特殊字符? [英] How can I parse a special character differently in two terminal rules using antlr?
问题描述
我有一个语法,它在许多终止规则的开头使用$
字符,例如$video{
,$audio{
,$image{
,$link{
等.
I have a grammar that uses the $
character at the start of many terminal rules, such as $video{
, $audio{
, $image{
, $link{
and others that are like this.
但是,我还要匹配所有不符合这些规则的$
和{
和}
字符.例如,我的语法与 CHUNK 规则中的$100
不正确匹配,但是将$
添加到 CHUNK 的一长串可接受字符中会导致其他产生打破规则.
However, I'd also like to match all the $
and {
and }
characters that don't match these rules too. For example, my grammar does not properly match $100
in the CHUNK rule, but adding the $
to the long list of acceptable characters in CHUNK causes the other production rules to break.
如何更改语法,使其足够聪明,可以将特殊的$,{和}字符与特殊生产规则区分开?
How can I change my grammar so that it's smart enough to distinguish normal $, { and } characters from my special production rules?
基本上我要说的是,如果$字符后面没有{,视频,图像,音频,链接等,则应该转到CHUNK".
Basically what I'd to be able to do is say, "if the $ character doesn't have {, video, image, audio, link, etc. after it, then it should go to CHUNK".
grammar Text;
@header {
}
@lexer::members {
private boolean readLabel = false;
private boolean readUrl = false;
}
@members {
private int numberOfVideos = 0;
private int numberOfAudios = 0;
private StringBuilder builder = new StringBuilder();
public String getResult() {
return builder.toString();
}
}
text
: expression*
;
expression
: fillInTheBlank
{
builder.append($fillInTheBlank.value);
}
| image
{
builder.append($image.value);
}
| video
{
builder.append($video.value);
}
| audio
{
builder.append($audio.value);
}
| link
{
builder.append($link.value);
}
| everythingElse
{
builder.append($everythingElse.value);
}
;
fillInTheBlank returns [String value]
: BEGIN_INPUT LABEL END_COMMAND
{
$value = "<input type=\"text\" id=\"" +
$LABEL.text +
"\" name=\"" +
$LABEL.text +
"\" class=\"FillInTheBlankAnswer\" />";
}
;
image returns [String value]
: BEGIN_IMAGE URL END_COMMAND
{
$value = "<img src=\"" + $URL.text + "\" />";
}
;
video returns [String value]
: BEGIN_VIDEO URL END_COMMAND
{
numberOfVideos++;
StringBuilder b = new StringBuilder();
b.append("<div id=\"video1\">Loading the player ...</div>\r\n");
b.append("<script type=\"text/javascript\">\r\n");
b.append("\tjwplayer(\"video" + numberOfVideos + "\").setup({\r\n");
b.append("\t\tflashplayer: \"/trainingdividend/js/jwplayer/player.swf\", file: \"");
b.append($URL.text);
b.append("\"\r\n\t});\r\n");
b.append("</script>\r\n");
$value = b.toString();
}
;
audio returns [String value]
: BEGIN_AUDIO URL END_COMMAND
{
numberOfAudios++;
StringBuilder b = new StringBuilder();
b.append("<p id=\"audioplayer_");
b.append(numberOfAudios);
b.append("\">Alternative content</p>\r\n");
b.append("<script type=\"text/javascript\">\r\n");
b.append("\tAudioPlayer.embed(\"audioplayer_");
b.append(numberOfAudios);
b.append("\", {soundFile: \"");
b.append($URL.text);
b.append("\"});\r\n");
b.append("</script>\r\n");
$value = b.toString();
}
;
link returns [String value]
: BEGIN_LINK URL END_COMMAND
{
$value = "<a href=\"" + $URL.text + "\">" + $URL.text + "</a>";
}
;
everythingElse returns [String value]
: CHUNK
{
$value = $CHUNK.text;
}
;
BEGIN_INPUT
: '${'
{
readLabel = true;
}
;
BEGIN_IMAGE
: '$image{'
{
readUrl = true;
}
;
BEGIN_VIDEO
: '$video{'
{
readUrl = true;
}
;
BEGIN_AUDIO
: '$audio{'
{
readUrl = true;
}
;
BEGIN_LINK
: '$link{'
{
readUrl = true;
}
;
END_COMMAND
: { readLabel || readUrl }?=> '}'
{
readLabel = false;
readUrl = false;
}
;
URL
: { readUrl }?=> 'http://' ('a'..'z'|'A'..'Z'|'0'..'9'|'.'|'/'|'-'|'_'|'%'|'&'|'?'|':')+
;
LABEL
: { readLabel }?=> ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
CHUNK
//: (~('${'|'$video{'|'$image{'|'$audio{'))+
: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|'\t'|'\n'|'\r'|'-'|','|'.'|'?'|'\''|':'|'\"'|'>'|'<'|'/'|'_'|'='|';'|'('|')'|'&'|'!'|'#'|'%'|'*')+
;
推荐答案
您最多只能取一个字符.因此,以下内容无效:
You can't negate more than a single character. So, the following is invalid:
~('${')
但是为什么不直接在CHUNK
规则中添加'$'
,'{'
和'}'
,并在CHUNK
规则末尾删除+
(否则,将其删除)会在源头进一步吞噬更多(可能是'$video{'
),就像您已经注意到自己一样)?
But why not simply add '$'
, '{'
and '}'
to your CHUNK
rule and remove the +
at the end of the CHUNK
rule (otherwise it would gobble up to much, possibly '$video{'
further in the source, as you have noticed yourself already)?.
现在CHUNK
令牌将始终包含一个字符,但是您可以创建生产规则来解决此问题:
Now a CHUNK
token will always consist of a single character, but you could create a production rule to fix this:
chunk
: CHUNK+
;
,并在生产规则中使用chunk
而不是CHUNK
(当然也可以使用CHUNK+
).
and use chunk
in your production rules instead of CHUNK
(or use CHUNK+
, of course).
像"{ } $foo $video{"
这样的输入将被标记为以下内容:
Input like "{ } $foo $video{"
would be tokenized as follows:
CHUNK {
CHUNK
CHUNK }
CHUNK
CHUNK $
CHUNK f
CHUNK o
CHUNK o
CHUNK
BEGIN_VIDEO $video{
编辑
如果让解析器输出AST,则可以轻松地将一个或多个CHUNK
匹配的所有文本合并到一个内部令牌为CHUNK
类型的AST中,如下所示:>
EDIT
And if you let your parser output an AST, you can easily merge all the text that one or more CHUNK
's match into a single AST, whose inner token is of type CHUNK
, like this:
grammar Text;
options {
output=AST;
}
...
chunk
: CHUNK+ -> {new CommonTree(new CommonToken(CHUNK, $text))}
;
...
这篇关于如何使用antlr在两个终端规则中以不同的方式解析特殊字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!