Antlr和PL/I语法 [英] Antlr and PL/I grammar

查看:85
本文介绍了Antlr和PL/I语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我们想了解基于Antlr4的PL/I,COBOL语法.有没有人提供这些语法 如果没有,可以请您分享一下从头开始开发这些语法的想法/经验 谢谢

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars If not, can you please share your thought/experience on developing these grammars from scratch Thanks

推荐答案

我假设您的意思是IBM PL/I和COBOL. (周围没有很多其他PL/I,但是我认为并不能真正改变答案).

I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).

寻找成熟的ANTLR语法的明显地方是 ANTLR3语法库;那里没有PL/1或COBOL语法. Antlr V4(ANTLR3的一种非常新的,根本的,向后不兼容的重新设计)主页讨论了Java和C#.那里没有PL/1或COBOL的提示;鉴于其新颖性,不足为奇.如果您真的很幸运,可能有人会给您并说出来.

The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.

开发这样的语法很困难,原因有几个(基于个人经验,使用与ANTLR不同的非常强大的解析器系统为这两个特定项目构建生产质量的解析器,请参见我的简历):

Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):

  • 字符集和列布局规则(第1-5、6和72-80列是特殊的)可能是一个问题:您所描述的语言通常以EBCDIC的形式写成打孔卡80列格式,没有换行符行之间.转换为ASCII有时会产生令人讨厌的毛刺. ASCII换行符偶尔会在COBOL文字字符串的中间以二进制值的形式出现,但是由于它在EBCDIC和ASCII中具有与 same 完全相同的代码,因此翻译后( )似乎是ASCII换行符.字符串也可以很长,但可以分成多行.但根据定义,第72-80列必须忽略.第6列可能包含"D"字符,这会影响以下源代码行的解释为"debug"或"not".这意味着您需要正确处理80个色谱柱.我不知道ANTLR必须支持在列区域中处理字符.您还需要担心字符串文字的DBCS编码,如果源代码用于非英语国家(例如日本),则还需要担心它的变化形式.
  • 这些语言既庞大又复杂; IBM已经有40年的历史了. IBM COBOL手册大约有600页...然后您发现COBOL还包括一个报告编写器,它是另一个 600页文档.捕获词汇标记和语法规则的所有细微差别将很费力,您必须从IBM手册中做到这一点,这些手册不包含漂亮的BNF风格的描述,这意味着从文本描述和一些示例中进行猜测.对于COBOL,期望有几千个语法规则; PL/1的抽象程度较轻.期望一定数量的谎言";我们已经在很多地方看到了参考文档明确指出某些事情是不合法的,但是IBM编译器(基于真实的,正在运行的源代码)接受了它们,反之亦然.找到这些的唯一方法是通过经验实验.
  • 两种语言都具有难以解析的结构,例如,要求任意超前和/或局部歧义.从我对这些方面的理解来看,ANTLR4比ANTLR3好得多,但这并不意味着这些方面都会很容易. PL/1在这方面特别讨厌:它没有关键字,但是有数百个上下文关键字.要解决这些问题,必须让词法分析器和解析器进行协作,即使那样,可能仍然存在许多本地模棱两可的解析器. ANTLR3做得不好. ANTLR4应该更好,但我不知道它如何处理(如果有的话).
  • 要验证这些解析器是否正确,您将需要在数百万行代码上运行它们(这意味着您必须有权访问此类代码示例),并更正发现的任何错误.这需要很长时间(在我们的情况下,要想在大型代码库上工作的生产质量语法,或多或少需要几年的连续工作/改进).您可能会比这快得多.祝你好运.
  • 您需要为COBOL(COPY ... REPLACING)构建一个预处理器,该预处理器的详细信息记录不充分,最后要为PL/1(我理解它具有完全的Turing功能)构建另一个预处理器.
  • 构建解析器后,您需要捕获语法树.这里的ANTLR4应该很好,因为它将捕获您提供的语法中的一个.那可能是您想要的AST,也可能不是.有数千个语法规则,我希望不会. ANTLR3要求您手动添加在何处以及如何形成AST的指示.
  • The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
  • These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
  • Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
  • To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
  • You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
  • After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.

获取AST之后,您将需要对其进行一些处理.这意味着您将至少需要构建符号表(从标识符实例到其声明的映射以及任何相关的类型信息). ANTLR除了支持步行AST以外,没有提供支持该AFAIK的特殊功能.这也很难解决,COBOL制定了疯狂的规则,即如果没有其他冲突的解释,那么如何才能将不合格的标识符引用解释为特定的数据字段. (如果您想获得有关程序的良好语义信息,那么解析后的生活"还有很多其他内容;有关更多详细信息,请参见我的简历;对于每个语义方面,您都应先进行开发,然后进行验证,然后再在大型代码库上运行它们)再次).

After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).

TL; DR

为这些语言构建解析器(嗯,前端")是一项很大的工作,无论您选择哪种解析引擎.可能会解释为什么它们还没有进入ANTLR的语法动物园.

Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.

这篇关于Antlr和PL/I语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆