编程语言解析器的来源? [英] Source of parsers for programming languages?

查看:36
本文介绍了编程语言解析器的来源?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在清理我的一个旧项目,该项目计算有关大型软件项目的许多简单指标.指标之一是文件/类/方法的长度.目前,我的代码猜测"其中的类/方法边界基于一个非常粗略的算法(遍历文件,保持当前深度"并在遇到未加引号的括号时对其进行调整;当你返回到一个类或方法开始的级别时,认为它已退出).但是,此过程存在许多问题,检测深度何时发生变化的简单"方法并不总是有效.

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.

为了使这给出准确的结果,我需要使用规范的方式(在每种语言中)来检测函数定义、类定义和深度变化.这相当于编写一个简单的解析器来为我希望我的项目适用的每种语言生成至少包含这些元素的解析树.

To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.

显然之前已经为所有这些语言编写了解析器,所以我似乎不必重复这项工作(即使编写解析器很有趣).是否有一些开源项目可以为一堆源语言收集现成的解析器库?或者我应该只使用 ANTLR 从头开始​​制作自己的?(注意:我很高兴将项目移植到另一种语言以利用现有的优秀资源,所以如果你知道,它不管它是用什么语言编写的.)

Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)

推荐答案

如果您想要语言精确的解析,尤其是面对宏和预处理器条件等语言复杂性时,您需要完整的语言解析器.这些实际上需要大量的构建工作,而且大多数语言并不适合周围的各种解析器生成器.语言解析器的大多数作者也对其他语言不感兴趣.他们倾向于选择一些解析器生成器,当他们开始时,这显然不是一个巨大的障碍,为他们打算的特定目的实现解析器,然后继续前进.

If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.

结果:很少有语言定义库是使用单一形式主义或共享基础定义的.ANTLR 人群维护着一个较大的集合,恕我直言,尽管据我所知,这些解析器中的大多数都不是很有生产能力.总是有 Bison,它已经存在了足够长的时间,因此您可能希望在某个地方收集一个语言定义库,但我从未见过.

Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.

在过去的 15 年里,我一直在定义用于程序分析和转换的基础机制,并构建了另一个这样的库,称为 DMS 软件再造工具包.它具有适用于 C、C++、C#、Java、COBOL(IBM 企业版)、JCL、PHP、Python 等的生产质量解析器.您的意见当然可能与我的不同,但这些每天都与 DMS 一起使用以执行大规模更改任务在大量代码上.

I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.

我不知道还有其他语言定义集是成熟的并建立在单一基础上的……可能是 IBM 的编译器是这样的一套,但 IBM 没有提供机器或语言定义.

I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.

如果您只想计算简单的指标,您可能只需要使用词法分析器和临时嵌套计数(如您所描述的).即使在大多数情况下,要使其正常工作(查看 Python、Perl 和 PHP 的疯狂字符串语法),这也比看起来更难.总而言之,即使是 C 语言也需要大量的工作来定义一个准确的词法分析器:我们有几千行复杂的正则表达式来涵盖您在 Microsoft 和/或 GNU C 中发现的所有奇怪的词素.

If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.

因为 DMS 对许多语言都有一致定义的、成熟的解析器,因此 DMS 对相同的语言也有一致定义的、成熟的词法分析器.我们实际上构建了一个 源代码搜索引擎 (SCSE),可提供大量代码的快速搜索在多种语言中,通过对遇到的语言进行词法分析并索引这些词素以进行快速查找.SCSE 恰好也计算您正在讨论的指标类型,因为它索引代码库,几乎按照您描述的方式进行,只是它具有这些语言准确的词法分析器可供使用.

Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.

这篇关于编程语言解析器的来源?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆