什么是LR(2)解析器?它与LR(1)解析器有何不同? [英] What is an LR(2) parser? How does it differ from an LR(1) parser?

查看:179
本文介绍了什么是LR(2)解析器?它与LR(1)解析器有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉LR(1)解析器,通常在传统的编译器课程中进行讲授.我知道LR(2)解析器存在,但我从未见过构造过的解析器.

如何构造LR(2)解析器?与LR(1)解析器有什么区别?

解决方案

在许多方面,LR(2)解析器的工作方式与LR(1)解析器类似.它找出反向最右边的派生,维护堆栈,对堆栈执行移位和减少操作,具有由LR项目集组成的状态等.但是,有一些主要区别:

  • LR(2)解析器为每个LR项目维护两个前瞻标记,而不是像LR(1)那样仅维护一个前瞻标记.
  • 移位的工作规则与LR(1)解析器的标准规则不同,并且需要一个额外的先行概念,称为 dot lookahead ,在LR(1)解析器中不存在.
  • 尽管与直觉相反,goto表的宽度相同,但LR(2)解析器的动作表的宽度比LR(1)解析器的动作表的宽度大得多.

为了激发它的工作原理,让我们以LR(2)语法为例.(此语法源自 @rici对先前问题的出色回答).

S→RS |R

R→abT

T→aT |c |ε

要为此语法构建LR(2)解析器,我们将像往常一样通过以S'→S:形式的产生来扩充语法来开始.

S'→S

S→RS |R

R→abT

T→aT |c |ε

现在,我们开始生成配置集.与LR(1)解析器一样,我们从产生S'→S开始.这显示在这里:

 (1)S'->.S [$$] 

在此注意,前瞻是$$,表示流的末尾"的两个副本.标记.在传统的LR(1)(或SLR(1)或LALR(1))解析器中,我们要提前$,只是流结束标记的一个副本.

我们现在开始扩展此配置集中的其他项目.由于我们有产品S→RS和S→R,因此我们添加以下项目:

 (1)S'->.S [$$]S->.R [$$]//新增S->.RS [$$]//新增 

现在,让我们开始追踪接下来会发生什么.就像在LR(1)解析器中一样,由于此处非终结点R之前有点,因此我们需要将其扩展.像在LR(1)解析中一样,我们需要确定要使用的提前行.我们将从扩展 S->开始..R [$$] 项,例如:

 (1)S'->.S [$$]S->.R [$$]-.RS [$$]-.abT [$$]//新增 

接下来,让我们扩展 S->.RS [$$] 选项.这是一个有趣的案例.我们需要确定在这里发现的R产品的前瞻性.在LR(1)解析器中,这可以通过查看生产剩余部分的第一组来找到.在LR(2)解析器中,因为我们有两个先行标记,所以我们必须查看 FIRST 2 ,这是FIRST集的概括,列出了长度为2的字符串可以出现在作品的开头,而不是长度为1的字符串可以出现在作品的前面.在我们的例子中,FIRST 2 (S)= {ab}(您知道为什么吗?),所以我们有以下内容:

 (1)S'->.S [$$]S->.R [$$]S->.RS [$$]-.abT [$$]-.abT [ab]//新增 

至此,我们已经完成了扩展第一个配置集的工作.现在是时候考虑如果我们接下来看到不同的角色该怎么办.幸运的是,在这种情况下这很容易,因为此语法产生的任何字符串的第一个字符都必须是 a .因此,让我们看看如果遇到 a :

会发生什么?

 (2)-a.bT [$$]-a.bT [ab] 

到目前为止似乎还不错.现在,如果我们在此处看到 b ,会发生什么?那将带我们到这个地方:

 (3)-ab.T [$$]-ab.T [ab] 

这里有两个LR(2)项在非终结点之前有点,因此我们需要扩展它们.让我们首先扩展 R->ab.T [$$] ,给出以下内容:

 (3)-ab.T [$$]-ab.T [ab]-.aT [$$]//新增-.c [$$]//新建-.[$$]//新增 

接下来,我们将扩展 R->的生产.ab.T [ab] :

 (3)-ab.T [$$]-ab.T [ab]-.aT [$$]-.c [$$]-.[$$]-.aT [ab]//新增-.c [ab]//新-.[ab]//新增 

这将填写此配置集.这是我们第一次发现一些完成的约简项目(此处为 T->.,具有两个不同的前瞻).我们这里也有一些班次项目.因此,我们不得不问-我们在这里是否存在转移/减少冲突或减少/减少冲突?

让我们从减少/减少冲突开始.与LR(1)解析中的情况一样,当存在两个具有相同前瞻性的不同reduce项(末尾带点的项目)时,我们会遇到reduce/reduce冲突.在这里,我们有两个不同的reduce项,但是它们具有不同的前瞻性.这意味着我们在减少/减少方面还可以.

现在,有趣的情况.我们在这里是否有任何转移/减少冲突?这是LR(1)解析发生一些变化的地方.与LR(1)解析一样,我们查看集合中的所有移位项(在终端之前带点的项目)和集合中的所有reduce项目(在末尾带点的项目).我们正在寻找是否存在任何冲突:

  T->.aT [$$]//Shift-.c [$$]//Shift-.[$$]//减少-.aT [ab]//Shift-.c [ab]//Shift-.[ab]//减少 

问题是,这里的班次/减少冲突是什么样的.在LR(2)解析器中,我们有两个前瞻标记,我们以此为基础来决定是否移动或缩小.对于reduce项,很容易看到先行的两个标记会导致我们减少-这是括号中的两个字符的先行.另一方面,考虑班次项 T->..c [ab] .我们将在此处转换的两个字符的超前位置是什么?对于LR(1)解析器,我们只是说哦,点在 c 之前,所以我们在 c 上移动",但这还不够.取而代之的是,我们认为与此班次项相关的前瞻是 ca ,其中 c 来自产品本身,而 a 从项目前瞻的第一个字符开始.

类似地,考虑班次项 T->.aT [$$] .我们需要两个先行字符,我们可以很容易地看到其中一个(点后的 a ).要获得第二个,我们必须看看 T 能够产生什么.T有三种产生方式:一种用ε代替T,一种用aT代替T,另一种用c代替T.这意味着可以从T派生的任何字符串都以 a c 开头,或者为空字符串.结果,项 T->.aT [$$] 告诉我们在看到 ac aa (我们从a和c得到的信息)或 a $ (如果我们使用产生式T→ε,然后从正常前行中选择$字符之一,我们将会得到什么.

更一般而言,按照相同的一般步骤-使用点后的终端,项的超前设置中的终端以及可以从将来的非终端派生的任何字符串的开头出现的字符-可以计算我们用来确定何时转移的两个字符的提前量.特别是,我们剩下的是:

 (3)-ab.T [$$]-ab.T [ab]-.aT [$$]//转换aa,ac,a $-.c [$$]//转换c $-.[$$]//减少$$-.aT [ab]//在aa,ac上移动-.c [ab]//在ca上移动-.[ab]//减少ab 

幸运的是,这里没有移位/减少冲突,因为每个两个字符的超前指令都会告诉我们要做一些不同的事情.

跳过点以确定何时移位是LR(k> 1)解析中出现的LR(2)解析的新事物,而不是LR(1)解析中出现的事物.在LR(1)解析中,我们只需要查看点之后的终端即可.在LR(2)解析中,由于我们需要更多的字符来确定要做什么,因此我们必须为每个班次计算一个辅助 dot lookahead .具体来说,点超前确定如下:

在产品A→α.tω[γ]中,其中t为末端,点超前是长度为2的所有字符串的集合,该字符串可以出现在可源自tωγ的字符串的开头.换句话说,生产A→α.tω[γ]的点前瞻等于FIRST 2 (tωγ).

牢记所有这些,我们可以构建完整的LR(2)解析器并描述与每个状态关联的动作.整个LR(2)解析器如下所示:

 (1)S'->.S [$$]//转到10S->.R [$$]//转到8S->.RS [$$]//转到8-.abT [$$]//移至ab,转到(2)-.abT [ab]//移至ab,转到(2)(2)-a.bT [$$]//移至ba,bc,b $,转到(3)-a.bT [ab]//移至ba,bc,转到(3)(3)-ab.T [$$]//转到7-ab.T [ab]//转到7-.aT [$$]//转换aa,ac,a $,转到(4)-.c [$$]//转移c $,转到(5)-.[$$]//减少$$-.aT [ab]//移开aa,ac,转到(4)-.c [ab]//转移ca,转到(5)-.[ab]//减少ab(4)-a.T [$$]//转到6-a.T [ab]//转到6-.[$$]//减少$$-.aT [$$]//转换aa,ac,a $,转到(4)-.c [$$]//转移c $,转到(5)-.[ab]//减少ab-.aT [ab]//移开aa,ac,转到(4)-.c [ab]//转移ca,转到(5)(5)-C.[$$]//减少$$-C.[ab]//减少ab(6)-在.[$$]//减少$$-在.[ab]//减少ab(7)-abT.[$$]//减少$$-abT.[ab]//减少ab(8)S->R. [$$]//减少$$-R.S [$$]//跳至9S->.RS [$$]//转到8S->.R [$$]//转到8-.abT [$$]//移至ab,转到(2)-.abT [ab]//移至ab,转到(2)(9)S->RS.[$$]//减少$$(10)S'->S. [$$]//接受$$ 

因此,现在我们有了用于此语法的LR(2)解析器.现在剩下要做的就是按照我们对LR(1)解析器所做的工作将其编码为一个动作表和goto表.

如您所料,LR(2)解析器的操作表与LR(1)解析器的操作表的不同之处在于,它由两个字符而不是一个字符给定的超前键.这意味着对于LR(2)解析器,动作表将比LR(1)解析器大得多.这是这里的样子:

 状态|aa |ab |交流|a $ || |bb |公元前|b $ |ca |cb |cc |c $ |$$------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----1 ||S |||||||||||------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----2 |||||S ||S |S |||||------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----3 |S |R |S |S |||||S |||S |[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----4 |S |R |S |S |||||S |||S |[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----5 ||R |||||||||||[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----6 ||R |||||||||||[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----7 ||R |||||||||||[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----8 ||S |||||||||||[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----9 |||||||||||||[R------ + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----+ ---- + ---- + ---- + ----10 |||||||||||||一个 

如您所见,此处的每个条目仅表示是平移还是平移.在实践中,每个reduce项都将标记有您实际要进行还原的生产,但是,嗯,我太懒了以致于无法输入.

在LR(1)解析表中,通常会将此表与"goto"表结合使用.表格说看到每个符号后要去哪里.那是由于偶然的巧合.在LR(1)解析器中,前瞻的大小为1,这恰好与以下事实相吻合:goto表说明了您在看到下一个字符后应过渡到的位置.在LR(2)解析器中,是否转移或减少的决定取决于前瞻的两个字符,但是在读取输入的另外一个字符后选择下一步走的位置仅取决于一个字符.也就是说,即使您有两个前瞻标记来决定是否要这样做,您一次也只能移一个字符.这意味着LR(2)解析器的goto表看起来很像LR(0)或LR(1)解析器的goto表,如下所示:

 状态|一个|b |c |$ |S |R |Ť------- + ----- + ----- + ----- + ----- + ----- + ----- + -----1 |2 ||||10 |8 |------- + ----- + ----- + ----- + ----- + ----- + ----- + -----2 ||3 |||||------- + ----- + ----- + ----- + ----- + ----- + ----- + -----3 |4 ||5 ||||7------- + ----- + ----- + ----- + ----- + ----- + ----- + -----4 |4 ||5 ||||6------- + ----- + ----- + ----- + ----- + ----- + ----- + -----5 |||||||------- + ----- + ----- + ----- + ----- + ----- + ----- + -----6 |||||||------- + ----- + ----- + ----- + ----- + ----- + ----- + -----7 |||||||------- + ----- + ----- + ----- + ----- + ----- + ----- + -----8 |2 ||||9 |8 |------- + ----- + ----- + ----- + ----- + ----- + ----- + -----9 |||||||------- + ----- + ----- + ----- + ----- + ----- + ----- + -----10 ||||||| 

所以,总结一下:

  • LR(2)解析器为每个LR项目使用两个超前标记.这意味着我们需要使用FIRST 2 集而不是FIRST集,并且在确定是否移动或减少时引入了一些新的复杂性.
  • LR(2)解析具有点超前.对于每个移位项目,我们使用FIRST 2 设置来确定哪些字符可以合法地跟随我们所处的点,然后对其中的任何一个进行移位.
  • LR(2)操作表是用成对的字符而不是单个字符键入的,但是goto表仍然是用字符键入的.
有趣的是,一旦您知道如何构建LR(2)解析器,就可以推广为k≥2的任何一个构建LR(k)解析器的想法.对于k越来越大的情况,这里会出现相同的结果.

在实践中,由于LR(2)解析器的操作表很大,并且由于前瞻性的提高,它们通常比LR(1)解析器具有更多的状态,因此很少使用LR(2)解析器.但是恕我直言,看看它们如何工作仍然值得.它使您了解LR解析的工作原理.

希望这会有所帮助!


非常感谢 @AProgrammer在cs.stackexchange.com上的答案关于点超前与项目前瞻,这有助于我更好地了解点前瞻的功能及其目的!

如果您想阅读有关LR(k)解析的原始文章,其中详细说明了LR(k)解析的完整规则,请查看关于从左翻译语言"向右"由唐·克努斯(Don Knuth).

I'm familiar with LR(1) parsers, which are usually taught in traditional compilers courses. I know that LR(2) parsers exist, but I've never seen one constructed before.

How is an LR(2) parser constructed? How does it differ from an LR(1) parser?

解决方案

In many ways, an LR(2) parser works similarly to an LR(1) parser. It traces out a rightmost derivation in reverse, maintains a stack, executes shift and reduce actions on a stack, has states that consist of sets of LR items, etc. There are, however, a few major differences:

  • An LR(2) parser maintains two tokens of lookahead for each LR item, rather than just a single token of lookahead as in LR(1).
  • The rules for how a shift works differs from the standard rules of an LR(1) parser and requires an additional notion of lookahead called dot lookahead not present in LR(1) parsers.
  • The width of the action table for an LR(2) parser is much larger than that of an LR(1) parser, though, counterintuitively, the goto table is the same width.

To motivate how this works, let's take an example of an LR(2) grammar. (This grammar is derived from one mentioned in @rici's excellent answer to this earlier question).

S → RS | R

R → abT

T → aT | c | ε

To build our LR(2) parser for this grammar, we'll begin, as usual, by augmenting the grammar with a production of the form S' → S:

S' → S

S → RS | R

R → abT

T → aT | c | ε

Now, we start generating configurating sets. We begin, as with an LR(1) parser, with the production S' → S. This is shown here:

(1)
    S' -> .S  [$$]

Notice here that the lookahead is $$, indicating two copies of the "end-of-stream" marker. In a traditional LR(1) (or SLR(1), or LALR(1)) parser, we'd have a lookahead here of $, just one copy of the end-of-stream marker.

We now start expanding out the other items in this configurating set. Since we have productions S → RS and S → R, we add these items:

(1)
    S' -> .S  [$$]
    S  -> .R  [$$]  // New
    S  -> .RS [$$]  // New

Now, let's start chasing down what happens next. As in an LR(1) parser, since there are dots before the nonterminal R's here, we need to expand them out. As in LR(1) parsing, as we do so, we'll need to determine what lookaheads to use. We'll begin by expanding out the S -> .R [$$] item, like this:

(1)
    S' -> .S   [$$]
    S  -> .R   [$$]
    S  -> .RS  [$$]
    R  -> .abT [$$]  // New

Next, let's expand out the S -> .RS [$$] option. This is an interesting case. We need to determine what the lookahead will be for the R productions discovered here. In an LR(1) parser, this is found by looking at the FIRST set of the remainder of the production. In an LR(2) parser, because we have two tokens of lookahead, we have to look a the FIRST2 set, which is the generalization of FIRST sets that lists the strings of length two that can appear at the front of the production, rather than the strings of length one that can appear there. In our case, FIRST2(S) = {ab} (do you see why?), so we have the following:

(1)
    S' -> .S   [$$]
    S  -> .R   [$$]
    S  -> .RS  [$$]
    R  -> .abT [$$]
    R  -> .abT [ab]  // New

At this point, we've finished expanding out the first configurating set. It's now time to think about what we would do if we saw different characters next. Fortunately, in this case that's fairly easy, since the first character of any string produced by this grammar has to be an a. So let's see what happens if we encounter an a:

(2)
    R  -> a.bT [$$]
    R  -> a.bT [ab]

Seems fine so far. Now what happens if we see a b here? That would take us to this spot:

(3)
    R  -> ab.T [$$]
    R  -> ab.T [ab]

There are two LR(2) items here that have dots before a nonterminal, so we need to expand these out. Let's begin by expanding these for R -> ab.T [$$], giving the following:

(3)
    R  -> ab.T [$$]
    R  -> ab.T [ab]
    T  -> .aT  [$$]  // New
    T  -> .c   [$$]  // New
    T  -> .    [$$]  // New

Next, we'll expand out the production for R -> ab.T [ab]:

(3)
    R  -> ab.T [$$]
    R  -> ab.T [ab]
    T  -> .aT  [$$]
    T  -> .c   [$$]
    T  -> .    [$$]
    T  -> .aT  [ab] // New
    T  -> .c   [ab] // New
    T  -> .    [ab] // New

That fills out this configurating set. This is the first time we've found some reduce items that are complete (here, T -> . with the two different lookaheads). We also have some shift items here. So we have to ask - do we have a shift/reduce conflict or a reduce/reduce conflict here?

Let's begin with reduce/reduce conflicts. As is the case in LR(1) parsing, we have a reduce/reduce conflict when there are two different reduce items (items with the dot at the end) with the same lookaheads. Here, we have two different reduce items, but they have different lookaheads. That means we're fine on the reduce/reduce front.

Now, the interesting case. Do we have any shift/reduce conflicts here? This is where things change a bit from LR(1) parsing. As is the case in LR(1) parsing, we look at all the shift items in our set (items with a dot before a terminal) and all the reduce items in our set (items with the dot at the end). We're looking to see if there's any conflicts:

    T  -> .aT  [$$] // Shift
    T  -> .c   [$$] // Shift
    T  -> .    [$$] // Reduce
    T  -> .aT  [ab] // Shift
    T  -> .c   [ab] // Shift
    T  -> .    [ab] // Reduce

The question, though, is what a shift/reduce conflict looks like here. In an LR(2) parser, we have two tokens of lookahead on which we base our decision of whether to shift or reduce. In the case of the reduce items, it's easy to see what two tokens of lookahead will lead us to reduce - it's the two-character lookahead in the brackets. On the other hand, consider the shift item T -> .c [ab]. What is the two-character lookahead here in which we'd shift? In the case of an LR(1) parser, we'd just say "oh, the dot's before the c, so we shift on c," but that's not sufficient here. Instead, we'd say that the lookahead associated with this shift item is ca, with the c coming from the production itself and the a coming from the first character of the item's lookahead.

Similarly, consider the shift item T -> .aT [$$]. We need two characters of lookahead, and we can see one of them pretty easily (the a after the dot). To get the second, we have to look at what T is capable of producing. There are three productions for T: one that replaces T with ε, one that replaces T with aT, and one that replaces T with c. This means that that any string that can be derived from T starts either with a or with c, or is the empty string. As a result, the item T -> .aT [$$] tells us to shift when seeing either ac or aa (what we'd get from the a and the c), or on a$ (what we'd get if we used the production T → ε, then picked up one of the $ characters from the normal lookahead.

More generally, following this same general procedure - using the terminal(s) after the dot, the terminals in the lookahead set for the item, and the characters that can appear at the front of any strings derivable from future nonterminals - we can compute the two-character lookaheads that we use to determine when to shift. In particular, we're left with this:

(3)
    R  -> ab.T [$$]
    R  -> ab.T [ab]
    T  -> .aT  [$$] // Shift  on aa, ac, a$
    T  -> .c   [$$] // Shift  on c$
    T  -> .    [$$] // Reduce on $$
    T  -> .aT  [ab] // Shift  on aa, ac
    T  -> .c   [ab] // Shift  on ca
    T  -> .    [ab] // Reduce on ab

Fortunately, there are no shift/reduce conflicts here, because each two-character lookahead tells us to do something different.

Looking past the dot to determine when to shift is something new to LR(2) parsing that appears in LR(k > 1) parsing but not LR(1) parsing. In LR(1) parsing, we just need to look at the terminal past the dot. In LR(2) parsing, since we need more characters to determine what to do, we have to compute a secondary dot lookahead for each shift item. Specifically, the dot lookahead is determined as follows:

In a production A → α.tω [γ], where t is a terminal, the dot lookahead is the set of all strings of length 2 that can appear at the start of a string derivable from tωγ. Stated differently, a production A → α.tω [γ] has dot lookahead equal to FIRST2(tωγ).

With all of this in mind, we can build the full LR(2) parser and describe the actions associated with each state. The overall LR(2) parser looks like this:

(1)
    S' -> .S   [$$]  // Go to 10
    S  -> .R   [$$]  // Go to 8
    S  -> .RS  [$$]  // Go to 8
    R  -> .abT [$$]  // Shift  on ab, go to (2)
    R  -> .abT [ab]  // Shift  on ab, go to (2)

(2)
    R  -> a.bT [$$]  // Shift  on ba, bc, b$, go to (3)
    R  -> a.bT [ab]  // Shift  on ba, bc,     go to (3)

(3)
    R  -> ab.T [$$] // Go to 7
    R  -> ab.T [ab] // Go to 7
    T  -> .aT  [$$] // Shift  on aa, ac, a$, go to (4)
    T  -> .c   [$$] // Shift  on c$,         go to (5)
    T  -> .    [$$] // Reduce on $$
    T  -> .aT  [ab] // Shift  on aa, ac,     go to (4)
    T  -> .c   [ab] // Shift  on ca,         go to (5)
    T  -> .    [ab] // Reduce on ab

(4)
    T  -> a.T  [$$] // Go to 6
    T  -> a.T  [ab] // Go to 6
    T  -> .    [$$] // Reduce on $$
    T  -> .aT  [$$] // Shift  on aa, ac, a$, go to (4)
    T  -> .c   [$$] // Shift  on c$,         go to (5)
    T  -> .    [ab] // Reduce on ab
    T  -> .aT  [ab] // Shift  on aa, ac,     go to (4)
    T  -> .c   [ab] // Shift  on ca,         go to (5)

(5)
    T  -> c.   [$$] // Reduce on $$
    T  -> c.   [ab] // Reduce on ab

(6)
    T  -> aT.  [$$] // Reduce on $$ 
    T  -> aT.  [ab] // Reduce on ab

(7)
    R  -> abT. [$$] // Reduce on $$
    R  -> abT. [ab] // Reduce on ab

(8)
    S  -> R.   [$$] // Reduce on $$
    S  -> R.S  [$$] // Go to 9
    S  -> .RS  [$$] // Go to 8
    S  -> .R   [$$] // Go to 8
    R  -> .abT [$$] // Shift  on ab, go to (2)
    R  -> .abT [ab] // Shift  on ab, go to (2)

(9)
    S  -> RS.  [$$] // Reduce on $$

(10)
    S' -> S.   [$$] // Accept on $$

So now we have our LR(2) parser for this grammar. All that's left to do now is to encode this as an action and goto table, along the lines of what we do for an LR(1) parser.

As you might expect, the action table for an LR(2) parser differs from that of the action table for an LR(1) parser in that it's keyed by lookaheads given by two characters rather than just one character. This means that the action table will be much larger for an LR(2) parser than an LR(1) parser. Here's what that looks like here:

 state | aa | ab | ac | a$ | ba | bb | bc | b$ | ca | cb | cc | c$ | $$ 
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   1   |    | S  |    |    |    |    |    |    |    |    |    |    |
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   2   |    |    |    |    | S  |    | S  | S  |    |    |    |    |
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   3   | S  | R  | S  | S  |    |    |    |    | S  |    |    | S  | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   4   | S  | R  | S  | S  |    |    |    |    | S  |    |    | S  | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   5   |    | R  |    |    |    |    |    |    |    |    |    |    | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   6   |    | R  |    |    |    |    |    |    |    |    |    |    | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   7   |    | R  |    |    |    |    |    |    |    |    |    |    | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   8   |    | S  |    |    |    |    |    |    |    |    |    |    | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   9   |    |    |    |    |    |    |    |    |    |    |    |    | R
 ------+----+----+----+----+----+----+----+----+----+----+----+----+----
   10  |    |    |    |    |    |    |    |    |    |    |    |    | A

As you can see, each entry here just says whether to shift or reduce. In practice, each reduce item would be tagged with which production you'd actually do the reduction for, but, um, I was too lazy to type that out.

In an LR(1) parsing table, you'd typically combine this table with the "goto" table saying where to go after seeing each symbol. That's due to a fortuitous coincidence. In an LR(1) parser, the size of the lookahead is 1, which happens to align with the fact that the goto table says where you're supposed to transition to after seeing the next character. In an LR(2) parser, the decision about whether to shift or reduce depends on two characters of lookahead, but the choice of where to go next after reading one more character of the input only depends on a single character. That is, even though you have two tokens of lookahead to decide whether to do, you only shift one character over at a time. This means that the goto table for an LR(2) parser looks a lot like the goto table for an LR(0) or LR(1) parser, and is shown here:

 state |  a  |  b  |  c  |  $  |  S  |  R  |  T
-------+-----+-----+-----+-----+-----+-----+-----
   1   |  2  |     |     |     |  10 |  8  |
-------+-----+-----+-----+-----+-----+-----+-----
   2   |     |  3  |     |     |     |     |
-------+-----+-----+-----+-----+-----+-----+-----
   3   |  4  |     |  5  |     |     |     |  7
-------+-----+-----+-----+-----+-----+-----+-----
   4   |  4  |     |  5  |     |     |     |  6
-------+-----+-----+-----+-----+-----+-----+-----
   5   |     |     |     |     |     |     |
-------+-----+-----+-----+-----+-----+-----+-----
   6   |     |     |     |     |     |     |
-------+-----+-----+-----+-----+-----+-----+-----
   7   |     |     |     |     |     |     |
-------+-----+-----+-----+-----+-----+-----+-----
   8   |  2  |     |     |     |  9  |  8  |
-------+-----+-----+-----+-----+-----+-----+-----
   9   |     |     |     |     |     |     |
-------+-----+-----+-----+-----+-----+-----+-----
   10  |     |     |     |     |     |     |

So, to summarize:

  • The LR(2) parser uses two tokens of lookahead for each LR item. This means that we need to work with FIRST2 sets rather than FIRST sets, and introduces some new complexity when determining whether to shift or to reduce.
  • LR(2) parses have dot lookahead. For each shift item, we use the FIRST2 set to determine what characters can legally follow the dot from where we are, and shift on any of them.
  • The LR(2) action table is keyed by pairs of characters rather than single characters, but the goto table still is keyed by character.

Interestingly, once you know how to build an LR(2) parser, you can generalize the idea to build an LR(k) parser for any k ≥ 2. In particular, all the "new surprises" that arose here are the same ones that you'll see for larger and larger values of k.

In practice, LR(2) parsers are rarely used due to the size of their action tables and the fact that they generally have way more states than an LR(1) parser due to the increased lookaheads. But it's still worthwhile, IMHO, to see how to they work. It gives you a sense of how LR parsing works more generally.

Hope this helps!


A huge thanks to @AProgrammer's answer on cs.stackexchange.com about dot lookaheads versus item lookaheads, which helped me get a better sense of how dot lookaheads function and what their purpose is!

If you'd like to read the original paper on LR(k) parsing, which specs out the full rules for LR(k) parsing in full detail, check out "On the Translation of Languages from Left to Right" by Don Knuth.

这篇关于什么是LR(2)解析器?它与LR(1)解析器有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆