解析一个块,其中每行以特定符号开头 [英] Parse a block where each line starts with a specific symbol

查看:109
本文介绍了解析一个块,其中每行以特定符号开头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个看起来像这样的代码块:

I need to parse a block of code which looks like this:

* Block
| Line 1
| Line 2
| ...

这很容易做到:

block : head lines;
head  : '*' line;
lines : lines '|' line
      | '|' line
      ;

现在我想知道如何添加嵌套块,例如:

Now I wonder, how can I add nested blocks, e.g.:

* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...

这可以表示为LALR语法吗?

我当然可以解析顶级块,然后再次运行解析器以处理这些顶级块中的每一个.但是,我只是在学习这个主题,因此避免这种方法对我来说很有趣.

I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.

推荐答案

嵌套块语言不是上下文无关的[注2],因此无法使用LALR(k)解析器进行解析.

The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.

但是,嵌套的括号语言当然是上下文无关的,并且通过替换词法扫描器中的初始 | 序列,将输入转换为括号形式相对容易.转换很简单:

However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:

  • | s的初始序列长于前一行时,请插入BEGIN_BLOCK. (初始序列必须确切地长一个 | ;否则可能是语法错误.)

  • when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)

| s的初始序列比前一行短时,将插入足够的END_BLOCK s以使期望的长度达到正确的值.

when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.

| 本身不会传递到解析器.

The |s themselves are not passed through to the parser.

这与INDENT/DEDENT策略非常相似,该策略用于解析可感知布局的语言,例如Python和Haskell.主要区别在于,这里我们不需要一堆缩进级别.

This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.

转换完成后,语法将类似于:

Once that transformation is finished, the grammar will look something like:

content: /* empty */
       | content line
       | content block
block  : head BEGIN_BLOCK content END_BLOCK
       | head
head   : '*' line

flex实现的大致轮廓如下:(请参见下面的注释1).

A rough outline of a flex implementation would be something like this: (see Note 1, below).

%x INDENT CONTENT
%%
  static int depth = 0, new_depth = 0;
  /* Handle pending END_BLOCKs */
  send_end:
    if (new_depth < depth) {
      --depth;
      return END_BLOCK;
  }
^"|"[[:blank:]]*   { new_depth = 1; BEGIN(INDENT); }
^.                 { new_depth = 0; yyless(0); BEGIN(CONTENT);
                     goto send_end; }
^\n                /* Ignore blank lines */
<INDENT>{
  "|"[[:blank:]]*  ++new_depth;
  .                { yyless(0); BEGIN(CONTENT);
                     if (new_depth > depth) {
                       ++depth;
                       if (new_depth > depth) { /* Report syntax error */ }
                       return BEGIN_BLOCK;
                     } else goto send_end;
                   }
  \n               BEGIN(INITIAL); /* Maybe you care about this blank line? */
}
  /* Put whatever you use here to lexically scan the lines */
<CONTENT>{
  \n               BEGIN(INITIAL);
}

注意:

  1. 并非每个人都会对goto感到满意,但是它节省了一些代码重复.状态变量(depthnew_depth)是局部static变量,这一事实使词法分析器不可重入且不可重启(发生错误后).这仅对玩具代码有用.对于任何真实的东西,您都应该重新输入词法扫描器,并将状态变量放入extra数据结构中.

  1. Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.

术语无上下文"和上下文敏感"是语法的技术描述,因此有点误导.基于单词所指含义的直觉通常是错误的.上下文敏感性的一种非常常见的来源是一种语言,其有效性取决于产生相同标记序列的同一非终端的两个不同派生. (假设非终端可以导出一个以上的令牌序列;否则,可以消除非终端.)

The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)

在普通的编程语言中,有许多这样的上下文敏感示例.通常,语法将允许这些构造,并且稍后将在某些语义分析阶段执行检查.这些要求包括声明一个标识符(IDENTIFIER的两个派生产生相同的字符串)或使用正确数量的参数调用一个函数的要求(在这里,只需要声明一个标识符的长度)即可.非终端匹配,但这足以触发上下文相关性.

There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).

在这种情况下,要求连续行中可能被称为bar-prefix的两个实例产生相同的 | 字符串.在这种情况下,由于效果实际上是语法上的,因此推迟到以后的语义分析将使解析的重点失效.上下文相关性的其他示例是句法"还是语义"是一个辩论,会产生令人惊讶的热量,而不会给讨论带来太多启发.

In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.

这篇关于解析一个块,其中每行以特定符号开头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆