Rust的词汇语法是规则的,与上下文无关的还是对上下文敏感的? [英] Is Rust's lexical grammar regular, context-free or context-sensitive?

查看:130
本文介绍了Rust的词汇语法是规则的,与上下文无关的还是对上下文敏感的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大多数编程语言的词汇语法都没有表现力,因此无法快速对其进行词汇化.我不确定Rust的词汇语法属于哪一类.大部分内容似乎都是常规的,可能是原始字符串文字:

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:

let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);

哪些印刷品:

Hi lovely "\" and "#", welcome to Rust

由于我们可以任意添加许多#,所以看起来它可能不是常规的,对吧?但是语法是否至少与上下文无关?还是关于Rust的词法语法有一些免费的背景知识?

As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?

相关:是Rust的语法语法是上下文无关的还是上下文敏感的?

推荐答案

原始字符串文字语法不是上下文无关的.

The raw string literal syntax is not context-free.

如果您将其视为由r#k"…"#k包围的字符串(使用上标k作为计数运算符),那么您可能会希望它与上下文无关:

If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:

raw_string_literal
   : 'r' delimited_quoted_string
delimited_quoted_string
   : quoted_string
   | '#' delimited_quoted_string '#'

但这实际上不是正确的语法,因为尽管quoted_string对于任何j<k

But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k

使用上下文无关的语法无法完成排除终止序列而不排除任何其他不同长度的相似序列,因为它涉及 3 (或更多)重复使用k一个生产,而堆栈自动机只能处理两个. (语法不是上下文无关的证明非常复杂,因此,由于缺少MathJax,我不会在这里尝试它.我能提供的最佳证明是使用Ogden引理和不常见的引文(但很有用)上下文无关的语法在有限状态换能器的应用下是封闭的.)

Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)

C ++原始字符串文字也是上下文敏感的(或者,如果不限制分隔符的长度,请参见注1),并且所有空白敏感的语言(如Python和Haskell)都很好地上下文敏感.这些词法分析任务没有一个特别复杂,因此上下文敏感度并不是一个大问题,尽管大多数标准扫描仪生成器并未提供可能需要的尽可能多的帮助.但是就在那里.

C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.

Rust的词汇语法为扫描仪生成器带来了其他一些复杂情况.一个问题是'的双重含义,它既可以用来创建字符文字,也可以用来标记生命周期变量和循环标签.显然,可以通过考虑先前识别的令牌来确定其中哪些适用.这可以用一个词法扫描器来解决,它可以从单个模式中生成两个连续的令牌,或者可以用无扫描器的解析器来完成.后一种解决方案将不受上下文限制,但不是常规的. (C ++将'用作数字文字的一部分不会引起相同的问题; C ++令牌可以用正则表达式识别,因为'不能用作正则表达式.数字文字的第一个字符.)

Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)

另一个与上下文有关的词汇问题是范围运算符..优先于浮点值,因此2..3必须被词法分析为三个标记: 2 .. 3 ,而不是两个浮点数 2. .3 ,这是大多数情况下的分析方法使用最大munch规则的语言.同样,这可能会或可能不会被视为与正则表达式标记化的偏差,因为它取决于尾随上下文.但是,由于前瞻最多只能是一个字符,因此可以肯定地可以使用DFA来实现.

Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.

反思时,我不确定询问词汇语法"是否有意义.或至少是模棱两可的:词汇语法"可能是指所有语言令牌"的组合语法,或者可能是指将句子分为令牌的行为.后者实际上是一个转换器,而不是解析器,并提出了是否可以使用有限状态转换器来标记语言的问题. (同样,答案是否定的,因为FSA甚至PDA都无法识别原始字符串.)

On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)

识别单个标记和标记输入流不一定等效.可以想象一种语言,其中所有令牌都可以由正则表达式识别,但是输入流无法通过有限状态转换器进行处理.如果存在两个正则表达式TU,那么某些匹配T的字符串是最长的记号,这是U中无限字符串的严格前缀,则将发生这种情况.作为一个简单(且毫无意义)的示例,请使用带标记的语言:

Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:

a
a*b

这两个令牌都是明显规则的,但是输入流无法用有限状态换能器进行令牌化,因为在决定是否回退到第一个a之前,它必须检查任何a序列(具有任何长度).或接受由所有a和随后的b(如果有)组成的令牌.

Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).

很少有语言显示这种病状(据我所知,Rust不是其中一种),但是从技术上讲,它在某些语言中以关键词为多词短语的语言中存在.

Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.

    实际上,从技术上讲,C ++原始字符串文字是常规的(因此不受上下文限制),因为它们的分隔符仅限于从88个字符的字母表中提取的最大长度为16的字符串.这意味着(从理论上来说)可以创建一个由13,082,362,351,752,551,144,309,757,252,761个模式组成的正则表达式,每个模式都匹配一个不同的可能的原始字符串定界符.
  1. Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

这篇关于Rust的词汇语法是规则的,与上下文无关的还是对上下文敏感的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆