如何在 antlr 中解析带有嵌套字符串分隔符的 PDF 字符串? [英] How do I parse PDF strings with nested string delimiters in antlr?

查看:16
本文介绍了如何在 antlr 中解析带有嵌套字符串分隔符的 PDF 字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析 PDF 内容流.字符串由括号分隔,但可以包含嵌套的未转义括号.来自 PDF 参考:

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:

文字字符串应写为括号中包含的任意数量的字符.除了不平衡括号(LEFT PARENHESIS (28h) 和 RIGHT PARENTHESIS (29h))和反斜杠(REVERSE SOLIDUS (5Ch))之外的任何字符都可以出现在字符串中,它们应按照本子条款中的描述进行特殊处理.字符串中的平衡括号对不需要特殊处理.

A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.

示例 1:

The following are valid literal strings: 
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)

似乎将词法分析器模式推入堆栈将是处理此问题的方法.这是我的词法分析器和解析器的精简版.

It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.

lexer grammar PdfStringLexer;

Tj: 'Tj' ;
TJ: 'TJ' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

NAME: '/' ID ;

// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ; 

// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; 

fragment INT: DIGIT+ ; // match 1 or more digits

fragment FLOAT:  DIGIT+ '.' DIGIT*  // match 1. 39. 3.14159 etc...
     |         '.' DIGIT+  // match .1 .14159
     ;

fragment DIGIT:   [0-9] ;        // match single digit

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;

WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

mode STR;

LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ; 
TEXT : . -> more ;

<小时>

parser grammar PdfStringParser;

options { tokenVocab=PdfStringLexer; } 

array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
    : NULL
    | array
    | dictionary
    | BOOLEAN
    | NUMBER
    | string
    | NAME
    ;

content : stat* ;

stat
    : tj
    ;

tj: ((string Tj) | (array TJ)) ; // Show text

当我处理这个文件时:

(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj

我收到此错误并解析树:

I get this error and parse tree:

line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'

所以也许 pushMode 不会将重复的模式推入堆栈.如果不是,处理嵌套括号的方法是什么?

So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?

我省略了有关字符串中转义序列的说明:

I left out the instructions regarding escape sequences within the string:

在文字字符串中,REVERSE SOLIDUS 用作转义字符.紧跟在 REVERSE SOLIDUS 之后的字符决定了其精确解释,如表 3 所示.如果 REVERSE SOLIDUS 之后的字符不是表 3 中所示的字符之一,则应忽略 REVERSE SOLIDUS.

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.

表3列出了\n\r\t\b退格(08h)、\f 换页 (FF)、\(\)\\\ddd 字符代码 ddd(八进制)

Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)

出现在没有前面 REVERSE SOLIDUS 的文字字符串中的行尾标记应被视为字节值 (0Ah),无论行尾标记是否为回车 (0Dh),换行符 (0Ah),或两者兼而有之.

An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.

示例 2:

(These \
two strings \
are the same.)
(These two strings are the same.)

示例 3:

(This string has an end-of-line at the end of it. 
)
(So does this one.\n)

我应该使用这个 STRING 定义吗:

Should I use this STRING definition:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

没有模式并在我的代码中处理转义序列或为字符串创建词法分析器模式并在语法中处理转义序列?

without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?

推荐答案

你可以用词法模式来做到这一点,但在这种情况下,它并不是真正需要的.您可以简单地定义一个词法分析器规则,如下所示:

You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

使用转义序列,您可以尝试:

And with escape sequences, you could try:

STRING
 : '(' ( ~[()\\]+ |  ESCAPE_SEQUENCE | STRING )* ')'
 ;

fragment ESCAPE_SEQUENCE
 : '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
 ;

这篇关于如何在 antlr 中解析带有嵌套字符串分隔符的 PDF 字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆