乔姆斯基语言类型 [英] Chomsky Language Types

查看:60
本文介绍了乔姆斯基语言类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解四种不同的乔姆斯基语言类型,但是我发现的定义对我而言实际上没有任何意义.我知道类型0是自由语法,类型1是上下文相关的,类型2是上下文无关的,而类型3是常规的.所以,有人可以解释一下,并把它放在上下文中,谢谢.

解决方案

一种语言是属于该语言的一组单词.但是,很多时候,无需列出语言中的每个单词,而是只需指定一组规则即可生成该语言的单词(并且仅生成那些单词)即可识别所查询的语言.

注意:可能存在不止一组描述相同语言的规则.

通常,对规则的限制越多,语言的表达性就越差(可以从规则中生成较少的单词),但是如果单词属于规则指定的语言,则更容易识别.由于存在后者,我们希望确定对规则有最大限制的语言,这些语言仍将使我们能够生成相同的语言.

关于规则的几句话:通常,您用四个项目描述正式语言(又称四元组):

  1. 一组非终结符号( N )
  2. 一组终端符号( T )
  3. 一组生产规则( P )
  4. 开始符号(S)

终端符号(AKA字母)是该语言单词所组成的符号,通常是小写英文字母(例如'a','b','c')的子集

非终止符号在单词的生成中标记中间状态,指示可能的变换仍可应用于中间单词".终端符号和非终端符号之间没有重叠(即,两组符号的交集为空).用于非终结符的符号通常是大写英文字母的子集(例如,"A","B","C")

规则表示对一系列终端符号和非终端符号的可能转换.它们的形式为:左侧->右侧,左侧和右侧均由一系列终端符号和非终端符号组成.规则示例:aBc->cBa,这意味着一系列符号"aBc"被使用.(作为中间单词"的一部分)可以被序列"cBa"替换为"cBa".在生成单词的过程中.

开始符号是专用的非结束符号(通常由S表示),表示根".或开始"在单词生成中,即在单词生成系列中应用的第一个规则始终以起始符号为左侧.

当所有非终结词都已被替换为终结词时,单词的生成成功结束(因此,最终的中间词"仅由终结符组成,这表示我们到达了in-language-in-语言问题).

当不是所有非终端都已被终端代替时,单词的生成是不成功的,但是没有可应用于当前中间单词"的生产规则.在这种情况下,必须沿着不同的生产规则应用路径,从起始符号开始重新生成.

示例:

L = { T N P ,S},

其中

T = {a,b,c}

N = {A,B,C,S}

P = {S-> A,S-> AB,S-> BC,A-> a,B-> bb,C-> ccc}

表示三个单词的语言:"a","abb"和"bbccc"

规则的示例应用程序:

S-> AB-> aB-> abb

我们1)从起始符号(S)开始,2)应用第二条规则(用AB替换S),3)应用第四条规则(用a替换A),4)应用第五条规则(替换B与bb).由于结果"abb"中没有非结尾,因此它是该语言的一个词.

在一般性地谈论规则时,希腊符号 alpha beta gamma 等表示(可能为空)一系列终端+非终端符号;希腊字母 epsilon 表示空字符串(即空符号系列).

Chomsky层次结构中的四种不同类型描述了具有不同表达能力(对规则的不同限制)的语法.

  • Type 0 (或 Unrestricted )语法生成的语言最富表现力(较少受限制).递归可枚举语言集包含可以使用图灵机(基本上是计算机)生成的语言.这意味着,如果您使用的语言比该类型的语言更具表达力(例如英语),则无法编写一种算法,该算法可以列出该语言的每个(只有这些)单词.规则有一个限制:规则的左侧不能为空(不能在突然之间"引入任何符号).

  • Type 1 (上下文相关)语法生成的
  • 语言是上下文相关语言.规则具有以下形式的限制: alpha A beta -> alpha gamma beta ,其中 alpha beta可以为空,而 gamma 为非空(例外:S-> epsilon 规则,仅当开始符号S没有出现在任何规则的右侧时才允许使用.这将规则限制为在其左侧至少有一个非终止符,并允许它们具有上下文":在此规则示例中的非终止符A可以替换为 gamma ,仅当它被 alpha beta 包围时(在上下文中").规则的应用会保留上下文(即 alpha beta 不变).

  • Type 2 (无上下文)语法生成的
  • 语言是无上下文语言.规则具有以下形式的限制: gamma .这将规则限制为在它们的左侧恰好具有一个非终结符并且没有上下文".这实质上意味着,如果您在中间单词中看到非终结符,则可以应用左侧带有该非终结符的任何规则,以将其替换为右侧,而与周围环境无关非终结符的符号.大多数编程语言都有上下文无关的生成语法.

  • Type 3 ( Regular )语法生成的
  • 语言是 Regular 语言.规则具有以下形式的限制:A-> a或A-> aB(如果开始符号S未出现在规则上,则允许规则S-> epsilon 任何规则的右侧),这意味着每个非终结符必须恰好产生一个终结符(也可能产生一个非终结符).正则表达式生成/识别这种类型的语言.

其中一些限制可以通过某种方式取消/修改,以使修改后的语法具有相同的表达能力.修改后的规则可以允许其他算法识别一种语言的单词.

请注意(如前所述),一种语言通常可以由多种语法(甚至属于不同类型的语法)生成.语言族的表达能力通常等同于可以生成那些语言的限制性最强的语法类型的表达能力(例如,由常规(类型3)语法生成的语言也可以由类型2语法生成,但是它们的表达能力仍然是3型语法的力量.

示例

常规语法

T = {a,b}

N = {A,B,S}

P = {S-> aA,A-> aA,A-> aB,B-> bB,B-> b}

生成一种语言,该语言包含以非零数字"a"开头,然后是非零数字"b"开头的单词.请注意,不可能描述一种语言,其中每个单词都由多个"a"后跟相等数量的"b"以及常规语法组成.

无上下文语法

T = {a,b}

N = {A,B,S}

P = {S-> ASB,A-> a,B-> b}

生成一种语言,该语言包含以非零个数的"a"开头,后跟相等个数的"b"字开头的单词.请注意,无法描述一种语言,其中每个单词都包含多个"a",然后是相等数量的"b",然后是相等数量的"c",并且具有上下文无关的语法.

上下文敏感语法

T = {a,b,c}

N = {A,B,C,H,S}

P = {S-> aBC,S-> aSBC,CB-> HB,HB-> HC,HC-> BC,aB-> ab,bB-> bb,bC-> bc,cC-> cc}

生成一种语言,该语言包含以非零个数的"a"开头,后跟相等数目的"b",后跟相等数目的"c"开头的单词.H在该语法中的作用是使交换"成为可能.CB组合到BC组合,因此B可以在左侧收集,而C可以​​在右侧收集.请注意,无法描述由单词"a"组成的语言,其中"a"的数量是上下文相关语法的质数,但是可以编写生成该语言的不受限制的语法.

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.

解决方案

A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.

Note: there can be more than one set of rules that desrcibe the same language.

In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.

A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):

  1. The set of non-terminal symbols (N)
  2. The set of terminal symbols (T)
  3. The set of production rules (P)
  4. The start symbol (S)

The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')

The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')

The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.

The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.

The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).

The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.

Example:

L={T, N, P, S},

where

T={a, b, c}

N={A, B, C, S}

P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}

which denotes the language of three words: "a", "abb" and "bbccc"

An example application of the rules:

S->AB->aB->abb

where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.

When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).

The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).

  • Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").

  • Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).

  • Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.

  • Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.

Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.

Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).

Examples

The regular grammar

T={a, b}

N={A, B, S}

P={S->aA, A->aA, A->aB, B->bB, B->b}

generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.

The context-free grammar

T={a, b}

N={A, B, S}

P={S->ASB, A->a, B->b}

generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.

The context-sensitive grammar

T={a, b, c}

N={A, B, C, H, S}

P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}

generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.

这篇关于乔姆斯基语言类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆