XML的形式语法 [英] Formal grammar of XML

查看:90
本文介绍了XML的形式语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用C构建XML文件的小型解析器.我知道,我可以找到一些完成的解决方案,但是,对于嵌入式项目,我只需要一些基本的东西.我正在尝试创建用于描述XML的语法,该语法不带属性,而只是带标签,但似乎无法正常工作,我无法弄清原因.

Im trying to build small parser for XML files in C. I know, i could find some finished solutions but, i need just some basic stuff for embedded project. I`m trying to create grammar for describing XML without attributes, just tags, but it seems it is not working and i was not able to figure out why.

这是语法:

   XML : FIRST_TAG NIZ
   NIZ : VAL NIZ | eps
   VAL : START VAL END
     | STR
     | eps

这是实现此语法的C代码的一部分:

Here is part of C code that implement this grammar :

void check() {

getSymbol();
if( sym == FIRST_LINE )
{
    niz();
}
else {
    printf("FIRST_LINE EXPECTED");
    exit(1);
 }
}

 void niz() {
getSymbol();
if( sym == ERROR )
    return;
if( sym == START ) {
    back = 1;
    val();
    niz();
}
printf(" EPS OR START EXPECTED\n");

}

void val() {
getSymbol();
if( sym == ERROR )
    return;
if( sym == START ) {
    back = 0;

    val();
    getSymbol();
    if( sym != END ) {
        printf("END EXPECTED");
        exit(1);
    }
    return;
}
if( sym == EMPTY_TAG || sym == STR)
    return;
printf("START, STR, EMPTY_TAG OR EPS EXPECTED\n");
exit(1);

}

 void getSymbol() {
int pom;

if(back == 1) {
    back = 0;
    return;
}
sym = getNextToken(cmd + offset, &pom);
offset += pom + 1;


   }

这是不满足此语法的XML文件的示例:

Here is the example of XML file that does not satisfy this grammar:

<?xml version="1.0"?> 
<VATCHANGES> 
<DATE>15/08/2012</DATE>
<TIME>1452</TIME>
<EFDSERIAL>01KE000001</EFDSERIAL> 
<CHANGENUM>1</CHANGENUM> 
<VATRATE>A</VATRATE> 
<FROMVALUE>16.00</FROMVALUE> 
<TOVALUE>18.00</TOVALUE> 
<VATRATE>B</VATRATE> 
<FROMVALUE>2.00</FROMVALUE> 
<TOVALUE>0.00</TOVALUE> 
<VATRATE>C</VATRATE> 
<FROMVALUE>5.00</FROMVALUE> 
<TOVALUE>0.00</TOVALUE> 
<DATE>25/05/2010</DATE> 
<CHANGENUM>2</CHANGENUM> 
<VATRATE>C</VATRATE> 
<FROMVALUE>0.00</FROMVALUE> 
<TOVALUE>4.00</TOVALUE> 
</VATCHANGES> 

在输出处给出END EXPECTED.

It gives END EXPECTED at the output.

推荐答案

首先,您的语法需要做一些工作.假设正确处理了序言,那么NIZ的定义就会出现基本错误.

First, your grammar needs some work. Assuming the preamble is handled correctly, you have a basic error in the definition of NIZ.

NIZ : VAL NIZ | eps
VAL : START VAL END
    | STR
    | eps

因此,我们输入NIZ,然后首先查找VAL.问题是VAL可能的生产 NIZ的结尾处都有eps.因此,如果VAL不产生任何结果(即eps)并且在该过程中不消耗任何令牌(由于eps是产生的,则不能是正确的),NIZ减少为:

So we enter NIZ and we look for VAL first. The problem is the eps on the end of both VAL's possible productions and NIZ. Therefore, if VAL produces nothing (i.e. eps) and consumes no tokens in the process (which it can't to be proper, since eps is the production), NIZ reduces to:

NIZ: eps NIZ | eps

这不好.

考虑以下方面的更多内容:我只是出于毫无预见性的目的而提出了这一点,以寻求超越纯粹基本构造的东西.

Consider into something more along these lines: I just spewed this with no real foresight into having something beyond a purely basic construction.

XML:         START_LINE ELEMENT
ELEMENT:     OPENTAG BODY CLOSETAG
OPENTAG:     lt id(n) gt
CLOSETAG:    lt fs id(n) gt
BODY:        ELEMENT | VALUE
VALUE:       str | eps

这是 super 的基础.终端包括:

This is super basic. Terminals include:

lt:    '<'
gt:    '>'
fs:    '/'
str:   any alphanumeric string excluding chars lt or gt.
id(n): any alphanumeric string excluding chars lt, gt, or fs. 

我现在几乎可以感觉到XML纯粹主义者的愤怒正在降临在我身上,但是我想了解的一点是,当语法定义明确时,RDP会按字面意义进行编写.显然,词法分析器(即令牌引擎)需要相应地处理终端.注意:id(n)是一个id堆栈,可确保您正确关闭最里面的标签,并且是解析器根据其管理标签id的方式而设置的属性.它不是传统的方法,但是它使事情变得更加容易.

I can almost feel the wrath of the XML purists raining down on me right now, but the point I'm trying to get across is that, when an grammar is well-defined, the RDP will literally write itself. Obviously the lexer (i.e. the token engine) needs to handle the terminals accordingly. Note: the id(n) is an id-stack to ensure you properly close the innermost tag, and is an attribute of your parser in accordance with how it manages tag ids. Its not traditional, but it makes things MUCH easier.

可以/应该将其扩展为包括独立元素声明和快捷元素关闭.例如,此语法允许使用以下形式的元素:

This can/should clearly be expanded to include stand-alone element declarations and short-cut element closure. For example, this grammar allows for elements of this form:

<ElementName>...</ElementName>

但不是这种形式:

<ElementName/>

它也不考虑捷径终止,例如:

Nor does it account for short-cut termination such as:

<ElementName>...</>

考虑这种加法显然会使语法变得相当复杂,但也会使解析器明显更健壮.就像我说的,上面的示例是 basic ,用大写B表示.如果您真的要着手进行这些,则在设计语法时要考虑这些因素,因此在设计RDP时也要考虑这些因素

Accounting for such additions will obviously complicate the grammar considerably, but also make the parser substantially more robust. Like I said, the sample above is basic with a capital B. If you're really going to embark on this these are things you want to consider when designing your grammar, and thus also your RDP by consequence.

无论如何,只要考虑一下语法中的一些重做如何/将实质上使您更容易做到这一点.

Anyway, just consider how a few reworks in your grammar can/will substantially make this easier on you.

这篇关于XML的形式语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆