Antlr 和 PL/I 语法 [英] Antlr and PL/I grammar

查看:22
本文介绍了Antlr 和 PL/I 语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我们想要PL/I的语法,基于Antlr4的COBOL.有没有人提供这些语法如果没有,您能否分享您从头开始开发这些语法的想法/经验谢谢

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars If not, can you please share your thought/experience on developing these grammars from scratch Thanks

推荐答案

我假设您指的是 IBM PL/I 和 COBOL.(周围没有很多其他 PL/Is,但我认为这不会真正改变答案).

I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).

寻找成熟的 ANTLR 语法的明显地方是 ANTLR3 语法库;那里没有 PL/1 或 COBOL 语法.Antlr V4(一个非常新的、激进的、向后不兼容的 ANTLR3 重新设计)主页谈论 Java 和 C#;那里没有 PL/1 或 COBOL 的暗示;鉴于它的新颖性,这并不奇怪.如果你真的很幸运,有人可能会给你一个,然后说出来.

The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.

开发这样的语法很困难,原因有几个(根据个人经验,为这两个特定项目构建生产质量的解析器,使用与 ANTLR 不同的非常强大的解析器系统[请参阅我的简介了解更多详细信息]):

Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):

  • 字符集和列布局规则(第 1-5、6 和 72-80 列是特殊的)可能是一个问题:您描述的语言在历史上通常是用 EBCDIC 以穿孔卡 80 列格式编写的,没有换行符线之间.转换为 ASCII 有时会产生令人讨厌的故障;ASCII 行尾字符偶尔会作为二进制值出现在 COBOL 文字字符串的中间,但是因为它在 EBCDIC 和 ASCII 中具有相同的确切代码,所以在翻译之后它将(是和) 似乎是一个 ASCII 换行符.字符串也可以很长,但可以分成多行;但根据定义,第 72-80 列必须被忽略.第 6 列可能包含D"字符,这会影响将以下源代码行解释为debug"或not".这意味着您需要正确处理 80 列.我不知道 ANTLR 必须支持处理列区域中的字符.如果源代码用于非英语国家(例如日本),您还需要担心字符串文字的 DBCS 编码以及该编码的变体.
  • 这些语言庞大而复杂;IBM 已经用了 40 年的时间来装饰它们.IBM COBOL 手册大约有 600 页……然后您会发现 COBOL 还包括一个 Report Writer,它是另一个 600 页的文档.捕捉词汇标记和语法规则的所有细微差别需要付出努力,您必须从 IBM 手册中做到这一点,其中不包含很好的 BNF 风格的描述,这意味着从文本描述和一些示例中进行猜测.对于 COBOL,期望有几千条语法规则;PL/1 在抽象上不那么复杂.期待一定数量的谎言";我们遇到过很多地方,参考文档明确指出某些事情是不合法的,但 IBM 编译器(基于真实的、正在运行的源代码)接受它们,反之亦然.找到这些的唯一方法是通过经验实验.
  • 两种语言都具有难以解析的结构,例如,需要任意前瞻和/或局部歧义.根据我对这些的理解,ANTLR4 比 ANTLR3 好得多,但这并不意味着这些方面会很容易.PL/1 在这方面特别讨厌:它没有关键字,但有数百个上下文关键字.要解决这些问题,就必须让词法分析器和解析器合作,即使如此,也可能会有许多局部不明确的解析.ANTLR3 在这些方面做得并不好;ANTLR4 应该更好,但我不知道它是如何处理这个问题的,如果有的话.
  • 要验证这些解析器是否正确,您需要在数百万行代码上运行它们(这意味着您必须访问此类代码示例),并更正您发现的任何错误.这需要很长时间(在我们的例子中,需要几年或多或少的连续工作/改进才能获得适用于大型代码库的生产质量语法).你可能奇迹般地比这更快;祝你好运.
  • 您需要为 COBOL (COPY ... REPLACING) 构建一个预处理器,其详细信息记录不足,最终为 PL/1 构建另一个预处理器(据我所知,它具有完全图灵能力).
  • 构建解析器后,需要捕获语法树;这里 ANTLR4 应该非常好,因为它会根据您提供的语法捕获一个.这可能是也可能不是您想要的 AST;有几千条语法规则,我希望不会.ANTLR3 要求您手动添加有关在何处以及如何形成 AST 的指示.
  • The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
  • These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
  • Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
  • To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
  • You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
  • After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.

获得 AST 后,您会想用它做点什么.这意味着您至少需要构建符号表(从标识符实例到它们的声明和任何相关类型信息的映射).除了支持走 AST 之外,ANTLR 没有提供任何特别的东西来支持这个 AFAIK.这也很难做到正确,COBOL 有关于如何将不合格的标识符引用解释为特定数据字段的疯狂规则,如果没有其他相互冲突的解释.(如果您想获得有关该程序的良好语义信息,还有更多内容可用于解析后的生活;有关更多详细信息,请参阅我的简历;对于这些语义方面中的每一个,您已经开发了它们,然后为了验证返回并在大型代码库上运行它们再次.)

After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).

TL;博士

无论您选择哪种解析引擎,为这些语言构建解析器(好吧,前端")都是一项大量的工作.可能解释了为什么它们还没有出现在 ANTLR 的语法动物园中.

Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.

这篇关于Antlr 和 PL/I 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆