LR(1)语法的状态,符号和规则的数量的合理上限是什么? [英] What's are reasonable upper bounds for the number of states, symbols, and rules of LR(1) grammars?

查看:242
本文介绍了LR(1)语法的状态,符号和规则的数量的合理上限是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个LR(1)解析器,我在各种地方遇到了性能瓶颈。

I'm making an LR(1) parser, and I've run across performance bottlenecks in various places.

我想尝试优化数据结构,但为了这样做,我需要一个粗略的想法,有多少状态,规则和终端符号是合理的(可能是复杂的)计算机语言,如C ++。

I'd like to try optimizing the the data structures for the parser, but in order to do so, I need a rough idea of how many states, rules, and terminal symbols are reasonable for (possibly complicated) computer languages, like C++.

我的猜测是,复杂语言的典型语法将具有:

My guesses are that a typical grammar for a complicated language would have:


  • 100个终端符号

  • ≤每个作品50个符号

  • ≤ 2,000条规则

  • ≤ 10,000个州

  • ≤ 100 terminal symbols
  • ≤ 50 symbols per production
  • ≤ 2,000 rules
  • ≤ 10,000 states

但我真的不知道他们是多么正确。

but I really don't know how correct they are.

请注意,我假设每个规则的形式是 nonterminal 符号 ...,因此单个化合物规则看起来像 foo: (bar | baz)+ 可能实际上包含5个规则,而不只是1个规则。

Note that I assume each rule is of the form nonterminalsymbol symbol symbol..., so a single compound "rule" that looks like foo: (bar | baz)+ might actually consist of, say, 5 rules, instead of just 1 rule.

推荐答案

我开发的DMS系统每天处理一个生产IBM Enterprise COBOL前端

The DMS system I developed daily processes a production IBM Enterprise COBOL front end grammar in about 7 seconds on a crummy laptop (measured just now on that laptop).

语法约有500个终端和2500个产品,平均约2.5个令牌
每次生产。我们的产品正如你所描述的(没有EBNF,只是不买足够的东西,是的,我们是大的DSL粉丝。有时,geegaws人放在DSL不值得)。解析器生成器产生3800个状态。 (刚刚测量的这些值)。

The grammar has about 500 terminals and 2500 productions, averaging about 2.5 tokens per production. Our productions are exactly as you describe them (no EBNF, just doesn't buy enough to matter, and yes, we're big DSL fans. Sometimes the geegaws people put in a DSL aren't worth it). The parser generator produces 3800 states. (These values measured just now, too).

DMS有一个完整的C ++ 11语法,有许多额外的东西来处理GCC和MS方言,以及OpenMP 。语法有457个终端,约3000个产品,每生产平均约2.3个令牌。解析器生成器产生5800个状态。需要稍长时间生成:11秒,在i7。你可能会发现令人惊讶的是,它需要
几十秒来生成词法分析器(真正多个词法分析器);在C ++ 11中有比你想象的更多的奇怪。

DMS has a full C++11 grammar with lots of extra stuff to handle GCC and MS dialects, as well as OpenMP. The grammar has 457 terminals, some 3000 productions, some 2.3 tokens per production average. The parser generator produces 5800 states. Takes somewhat longer to generate: 11 seconds, on an i7. What you might find surprising is that it takes some tens of seconds to generate the lexer (really multiple lexers); there's a lot more lexing weirdness in C++11 than you'd expect.

生成器是我们自己实现的GLR生成器。

The generator is a GLR generator of our own implementation.

我们没有做很多事情来优化生成时间。它可能加快10倍或更多;我们不进行复杂的循环检测优化,如在大多数论文中建议LR解析器生成。结果是生成表需要更长的时间,但在功能上没有任何损失。我们从来没有足够的动力来做这个优化,因为语言前端还有很多其他的事情需要关心解析器表生成时间。

We didn't do a lot to optimize the generation time. It likely can be sped up by a factor of 10 or more; we don't do sophisticated cycle detection optimization as suggested in most papers on LR parser generation. The consequence is that it takes longer to generate the tables but nothing is lost in functionality. We've never had enough motivation to do that optimization, because there are so many other things to do with a language front end than worry about the parser table generation time.

我怀疑数据结构很重要,如果设计合理。我们不用担心规则,项目集或状态的大小;我们只是使用动态数组,他们照顾自己。

I doubt the data structures matter a lot, if designed reasonably. We don't worry much about sizes of rules, item sets or states; we just use dynamic arrays and they take care of themselves. We do pack lookaheads into dense bit vectors.

作为额外的背景数据,你可能会发现这篇文章很有用: Tiago Alves和Joost Visser,SDF Grammars的制作。技术报告,DI-Research.PURe-05.05.01,Departamento deInformática,Universidade do Minho,May 2005.

As additional background data, you'll probably find this paper useful: Tiago Alves and Joost Visser, Metrication of SDF Grammars. Technical Report, DI-Research.PURe-05.05.01, Departamento de Informática, Universidade do Minho, May 2005.

解析器生成器不是你有一个困难的时间与语法。对于特定的实现,它正在获得语法规则

The parser generator isn't where you have a difficult time with grammars. It is getting the grammar rules right for the specific implementations.

这篇关于LR(1)语法的状态,符号和规则的数量的合理上限是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆