是否存在表示正则表达式的正则语言? [英] Is there a regular language to represent regular expressions?

查看:131
本文介绍了是否存在表示正则表达式的正则语言?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具体来说,我注意到正则表达式本身不是正则语言。因此,我无法使用正则表达式来解析给定的正则表达式。因为正则表达式本身的语言是上下文无关的,所以我需要使用解析器。

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.

有没有办法以表示结果字符串可以表示正则表达式的方式

Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?

注意:我的问题不是关于是否有一个与当前regexe语法匹配的regexp,而是是否存在一个表示形式对于我们今天所知的正则表达式(可能不像我们今天所知道的那样整洁)可以使用正则表达式进行解析。另外,由于不是重复项,请有人删除。我要问的是完全不同的东西。我已经知道正则表达式的当前语言不是正则语言(这就是我提出原始问题的方式)。

Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).

推荐答案

根据您所说的代表的意思,答案是是。或否:

Depending on what you mean by "represent", the answer is "yes" or "no":

如果您想要一种(同形)将1:1映射到常用基本正则表达式语言的语言,答案是否定的,因为常规语言不能同构

If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular.

如果代表表示非正规语言,则标准正则表达式语言是非正规语言。仅仅意味着指定常规语言的另一种方法,答案是肯定的,现在我至少可以想到三种实现方法:

If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:


  1. 最愚蠢最简单的方法是定义一些射影映射 f:ℕ-> RegEx 从自然数转换为所有有效标准正则表达式的集合。您可以使用正则表达式 0 | 1 [01] * 和以(字符串表示)自然数 n 是用 f(n)表示的常规语言。

  1. The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]*, and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n).

当然,其含义是对于自然读者而言,自然表达到自然数根本是不明显的,因此,这种正则表达式语言是指

Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.

由于括号是简单正则表达式中唯一的非正则部分,因此人类最容易理解的方法是扩展标准的简单正则表达式语法,以允许悬空括号并为悬空括号定义语义。

As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.

显而易见的选择是忽略不匹配的右括号并将不匹配的右括号解释为匹配正则表达式的开始。从本质上讲,这相当于根据需要在正则表达式的开头隐式插入了多个括号,在结尾处隐式插入了多个括号。另外,(* 必须被解释为空字符串的重复。如果我没有错过任何内容,则此定义应将任何字符串转换为正则表达式。具有指定含义,因此。* 定义此正则表达式语言。

The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".

此变体甚至具有与标准正则表达式。

This variant even has the same abstract syntax as standard regular expressions.

另一个变体是指定使用常规语言直接识别语言的NFA,例如:([[az] +,([^,] | \\,| \)+,[az] + \ $ ?;)*

Another variant would be to specify the NFA that recognizes the language directly using a regular language, e.g.: ([a-z]+,([^,]|\\,|\\\\)+,[a-z]+\$?;)*.

这个想法是 [az] + 用作状态的标签,而表达式是过渡三元组的列表(s,c,t)从源状态 s 到目标状态 t 消费字符 c $ 表示接受转换(请参见下面的注释)。 $ c> c ,反斜杠用于转义逗号或反斜杠-我假设t您可以将相同的字母用于标准正则表达式,但是您当然可以用其他任何表示符号的正则语言替换中间组件,这些符号表示您想要的任何字母字符。
提到的第一个源状态是(单个)初始状态。空表达式定义空语言。

The idea is that [a-z]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c, and a $ indicating accepting transitions (cf. note below). In c, backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish. The first source state mentioned is the (single) initial state. An empty expression defines the empty language.

上面,我写了接受过渡,而不是接受状态。因为用纯常规语言很难表示出来。您可以将包含 $ 的三元组解释为两个转换,即一个转换从消耗 c s 到新的唯一状态,以及从该状态到 t 的ε转换。通过用 $ 三元组替换每个进入接受状态的过渡,并用非<$ c替换每个进入不接受状态的过渡,这应该可以表示任何NFA。 $ c> $ 三元组。

Above, I wrote "accepting transition", not "accepting state" because that would be a bit hard to represent in a purely regular language. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t. This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non-$ triple.

一个可能使是的音符出现在屏幕上。零件看起来更直观:汇编语言是常规的,甚至是图灵完整的,因此如果无法指定纯或纯,这将是意外的。使用常规语言的常规语言。

One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.

这篇关于是否存在表示正则表达式的正则语言?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆