您将如何实施越位规则? [英] How would you go about implementing off-side rule?

查看:134
本文介绍了您将如何实施越位规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了可以完成此操作的生成器,但是我想知道实现越位规则的最佳方法.

I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.

简而言之:越位规则在这种情况下意味着缩进被识别为一个句法元素.

Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.

这是伪代码中的越位规则,用于使令牌生成程序捕获可用形式的缩进,我不想按语言来限制答案:

Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:

token NEWLINE
    matches r"\n\ *"
    increase line count
    pick up and store the indentation level
    remember to also record the current level of parenthesis

procedure layout tokens
    level = stack of indentation levels
    push 0 to level
    last_newline = none
    per each token
        if it is NEWLINE put it to last_newline and get next token
        if last_newline contains something
            extract new_level and parenthesis_count from last_newline
            - if newline was inside parentheses, do nothing
            - if new_level > level.top
                push new_level to level
                emit last_newline as INDENT token and clear last_newline
            - if new_level == level.top
                emit last_newline and clear last_newline
            - otherwise
                while new_level < level.top
                    pop from level
                    if new_level > level.top
                        freak out, indentation is broken.
                    emit last_newline as DEDENT token
                clear last_newline
        emit token
    while level.top != 0
        emit token as DEDENT token
        pop from level

comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser

此布局器一次不会生成多个NEWLINE,并且在出现缩进时也不会生成NEWLINE.因此,解析规则仍然非常简单.我认为这很好,但是请告知是否有更好的方法来完成它.

This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.

虽然使用了一段时间,但我注意到在DEDENTs之后仍然可以发出换行符,这种方式可以用NEWLINE分隔表达式,同时保持INDENT DEDENT作为表达式的尾标.

While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.

推荐答案

在过去的几年中,我已经为几种以缩进为中心的领域专用语言编写了分词器和解析器.对我而言,合理的价格是值得的.如果我没记错的话,例如,您的方法与Python的方法非常相似,似乎它应该具有一定的分量.

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

在碰到解析器之前将NEWLINE NEWLINE INDENT转换为INDENT似乎绝对是正确的操作方式-一直在解析器中一直为此偷偷摸摸很痛苦(IME)!我实际上已经作为一个单独的层完成了这一步,最后是一个三步过程:第一步将您的词法分析器和布局器的工作相结合,减去所有NEWLINE前瞻性内容(这非常简单),第二步(也非常简单) )层折叠连续的NEWLINEs并将NEWLINE INDENT转换为INDENT(或者实际上是将冒号NEWLINE INDENT转换为INDENT,因为在这种情况下,所有缩进的块始终前面都带有冒号),因此解析器是此之上的第三阶段.但这对我来说,按照您描述它们的方式做事也很有意义,特别是如果您希望将词法分析器与布局器分开,如果您使用的是代码生成工具,那大概就是想做的事情例如,按照通常的做法制作词法分析器.

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

我确实有一个应用程序需要对缩进规则更加灵活,本质上是让解析器在需要时执行它们-例如,以下内容在某些情况下必须有效:

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

不适用于INDENT/DEDENT令牌,因为最终需要为缩进的每一列生成一个INDENT,并在返回的路上生成相等数量的DEDENT,除非您向前看要找出哪里缩进级别将最终结束,这似乎并不像您希望分词器那样.在那种情况下,我尝试了几种不同的方法,最后只是在每个NEWLINE令牌中存储了一个计数器,该计数器为以下逻辑行的缩进量(正数或负数)进行了更改. (在需要保留的情况下,每个令牌还存储了所有尾随空格;对于NEWLINE,存储的空格包括EOL本身,中间的任何空白行以及以下逻辑行上的缩进.)根本没有单独的INDENT或DEDENT令牌.处理解析器要比嵌套INDENT和DEDENT多得多,而且可能因为复杂的语法而需要复杂的语法,这需要花哨的解析器生成器,但它并没有我所担心的那么糟糕,任何一个.同样,解析器无需从NEWLINE向前看,看看该方案中是否有索引.

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

仍然,我想您会同意在令牌生成器/布局器中允许并保留所有形式的疯狂外观,让解析器确定什么是文字和什么代码是不寻常的要求!例如,如果您只是想解析Python代码,那么您当然不希望解析器被该缩进计数器所困扰.您做事的方式几乎肯定是适用于您的应用程序以及其他许多方法的正确方法.尽管如果有人对如何最好地做这种事情有任何想法,我显然很希望听到他们的声音…….

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

这篇关于您将如何实施越位规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆