如何在Antlr中解析带有嵌套字符串定界符的PDF字符串? [英] How do I parse PDF strings with nested string delimiters in antlr?

查看:152
本文介绍了如何在Antlr中解析带有嵌套字符串定界符的PDF字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析PDF内容流.字符串由括号定界,但可以包含嵌套的未转义括号.从PDF参考:

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:

文字字符串应写成用括号括起来的任意数量的字符.除不平衡括号(左括号(28h)和右括号(29h))和反斜杠(RESOLE SOLIDUS(5Ch))外,任何字符都可能出现在字符串中,这些字符应按本小节中的描述进行特殊处理.字符串中的括号对之间不需要特殊处理.

A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.

示例1:

The following are valid literal strings: 
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)

似乎可以将lexer模式推到堆栈上来解决这个问题.这是我的词法分析器和解析器的精简版本.

It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.

lexer grammar PdfStringLexer;

Tj: 'Tj' ;
TJ: 'TJ' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

NAME: '/' ID ;

// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ; 

// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; 

fragment INT: DIGIT+ ; // match 1 or more digits

fragment FLOAT:  DIGIT+ '.' DIGIT*  // match 1. 39. 3.14159 etc...
     |         '.' DIGIT+  // match .1 .14159
     ;

fragment DIGIT:   [0-9] ;        // match single digit

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;

WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

mode STR;

LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ; 
TEXT : . -> more ;


parser grammar PdfStringParser;

options { tokenVocab=PdfStringLexer; } 

array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
    : NULL
    | array
    | dictionary
    | BOOLEAN
    | NUMBER
    | string
    | NAME
    ;

content : stat* ;

stat
    : tj
    ;

tj: ((string Tj) | (array TJ)) ; // Show text

当我处理此文件时:

(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj

我收到此错误并解析树:

I get this error and parse tree:

line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'

因此,也许pushMode不会将重复模式压入堆栈.如果没有,处理嵌套括号的方式将是什么?

So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?

我省略了有关字符串中转义序列的说明:

I left out the instructions regarding escape sequences within the string:

在文字字符串中,REVERSE SOLIDUS用作转义字符.如表3所示,REVERSE SOLIDUS后面的字符确定其精确解释.如果REVERSE SOLIDUS后面的字符不是表3所示的字符之一,则应忽略REVERSE SOLIDUS.

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.

表3列出了\n\r\t\b退格键(08h),\f换页(FF),\(\)\\\ddd字符代码 ddd (八进制)

Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)

出现在文字字符串中且没有前面的RESOLE SOLIDUS的行尾标记应被视为字节值(0Ah),无论该行尾标记是否为CARRIAGE RETURN(0Dh), LINE FEED(0Ah),或两者兼而有之.

An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.

示例2:

(These \
two strings \
are the same.)
(These two strings are the same.)

示例3:

(This string has an end-of-line at the end of it. 
)
(So does this one.\n)

我应该使用以下STRING定义:

Should I use this STRING definition:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

在我的代码中没有模式并处理转义序列,还是为字符串创建词法分析器模式并在语法中处理转义序列?

without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?

推荐答案

您可以使用词法模式执行此操作,但是在这种情况下,并不需要它.您可以像这样简单地定义词法分析器规则:

You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

使用转义序列,您可以尝试:

And with escape sequences, you could try:

STRING
 : '(' ( ~[()\\]+ |  ESCAPE_SEQUENCE | STRING )* ')'
 ;

fragment ESCAPE_SEQUENCE
 : '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
 ;

这篇关于如何在Antlr中解析带有嵌套字符串定界符的PDF字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆