用Regex解析XML / XHTML数据 [英] Parsing XML/XHTML data with Regex

查看:93
本文介绍了用Regex解析XML / XHTML数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过这个着名的帖子。我已经看到这些尝试,无论是成功还是失败。哦,这里和其他地方的火焰战争。



但是可以做到。



意识到实际的参数(读取事实)是正则表达式不适合解析结构化数据树,由于它们无法监视和更改状态,我觉得有些人盲目地丢弃了可能性。应用程序逻辑是保持状态所必需的,但正如这个工作示例所示,可以完成。



相关的代码段如下:

  const PARSE_MODE_NEXT = 0; 
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;

protected $ _parseModes = array(
self :: PARSE_MODE_NEXT =>'%<(?:(?(?< entity>注释> - )|(?< cdata> \ [CDATA\ [)))|(?< proc> \?))?%six',
self :: PARSE_MODE_ELEMENT => '%(?< close> /)?(?< element>。*?)(?< empty> /)?>(< text> [^<] *)%6',
self :: PARSE_MODE_ENTITY =>'%(?< entity>。*?)>(?< text> [^<] *)%six',
self :: PARSE_MODE_COMMENT = >'%(?< comment>。*?) - >(?< text> [^<] *)%6',
self :: PARSE_MODE_CDATA =>'% < cdata>。*?)\] \]>(?< text> [^<] *)%6',
self :: PARSE_MODE_PROC =>'% proc>。*?)\?>(?< text> [^<] *)%6',
);

public function load($ string){
$ parseMode = self :: PARSE_MODE_NEXT;
$ parseOffset = 0;
$ context = $ this;
while(preg_match($ this-> _parseModes [$ parseMode],$ string,$ match,PREG_OFFSET_CAPTURE,$ parseOffset)){
if($ parseMode == self :: PARSE_MODE_NEXT){
switch(true){
case(!($ match ['entity'] [0] || $ match ['comment'] [0] || $ match ['cdata'] [0] | $ match ['proc'] [0]))
$ parseMode = self :: PARSE_MODE_ELEMENT;
break;
case($ match ['proc'] [0]):
$ parseMode = self :: PARSE_MODE_PROC;
break;
case($ match ['cdata'] [0]):
$ parseMode = self :: PARSE_MODE_CDATA;
break;
case($ match ['comment'] [0]):
$ parseMode = self :: PARSE_MODE_COMMENT;
break;
case($ match ['entity'] [0]):
$ parseMode = self :: PARSE_MODE_ENTITY;
break;
}
} else {
switch($ parseMode){
case(self :: PARSE_MODE_ELEMENT):
switch(true){
case(! ($ match ['close'] [0] || $ match ['empty'] [0]))
$ context = $ context-> addChild(new ZuqMLElement($ match ['element'] [0]));
break;
case($ match ['empty'] [0]):
$ context-> addChild(new ZuqMLElement($ match ['element'] [0]));
break;
case($ match ['close'] [0]):
$ context = $ context-> _parent;
break;
}
break;
case(self :: PARSE_MODE_ENTITY):
$ context-> addChild(new ZuqMLEntity($ match ['entity'] [0]));
break;
case(self :: PARSE_MODE_COMMENT):
$ context-> addChild(new ZuqMLComment($ match ['comment'] [0]));
break;
case(self :: PARSE_MODE_CDATA):
$ context-> addChild(new ZuqMLCharacterData($ match ['cdata'] [0]));
break;
case(self :: PARSE_MODE_PROC):
$ context-> addChild(new ZuqMLProcessingInstruction($ match ['proc'] [0]));
break;
}
$ parseMode = self :: PARSE_MODE_NEXT;
}
if(trim($ match ['text'] [0])){
$ context-> addChild(new ZuqMLText($ match ['text'] [0] ));
}
$ parseOffset = $ match [0] [1] + strlen($ match [0] [0]);
}

}

是否完整? 不,



难道不可以吗?当然不会。



$ / $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $是否支持XPath / XQuery?显然不是。



它是否验证或执行任何其他辅助任务? / />



是否会取代DOM?地狱号码



,是否解析?

  ?xml version =1.0encoding =utf-8?> 
<!ENTITY name =value>
< root>
< node>
< node />
Foo
< node name =value>
< node> Bar< / node>
< / node>
<! - 评论 - >
< / node>
< node>
< [CDATA [Character Data]]>
< / node>
< / root>

是的。是的,它会。



虽然我希望这个线程成为一个社区维基,因为它符合要求,我会把这个声明成一个问题。 / p>

专注于正则表达式,任何人都可以预见到这种情况在用于反对格式化的标记时可能会失败?我想我已经覆盖了我所有的基地。



我没有搅拌锅的意图,但是我想要双方的洞察力的硬币。



请注意,写这个的目的是 SimpleXML 太简单了, code> DOM 对我的一个应用程序来说太严格了。

解决方案


关注正则表达式,任何人都可以预见到这种情况在用于反对格式化标记的情况下可能会失败?当对XML一致性测试套件运行时,它拒绝了多少格式良好的XML文档,以及如何许多格式不正确的XML文档可以接受吗?


也许分享XML社区文化的人最大的反对意见是不仅可以解析最完整的XML文档,还将解析大多数非XML文档,因为它不会告诉您它们的形式不正确。现在也许你认为在你的环境中并不重要 - 但是最后,如果你接受不合格的文件,那么人们会开始发送你不正确的文件,并且在很久以前你和HTML一样混乱你必须接受任何旧垃圾的遗留原因。



我不知道足够的PHP来快速判断你的代码对于格式良好的XML有多好。但是我怀疑这个动机 - 为什么一个地球可以手工编写一个便宜又肮脏和慢的XML解析器,当有完美的,正确的,快速的,免费的?


I've read the famous post. I've seen the attempts, both in limited success and failure. Oh, the flame wars, both here and elsewhere.

But it can be done.

While I'm aware that the actual argument (read fact) is that regular expressions are unfit to parse structured data trees, due to their inability to monitor and change state, I feel that some blindly discard the possibility. Application logic is necessary to keep state, but as this working example shows, it can be done.

Relevant snippet follows:

const PARSE_MODE_NEXT = 0;
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;

protected $_parseModes = array(
        self::PARSE_MODE_NEXT     => '% < (?: (?: (?<entity>!) (?: (?<comment>--) | (?<cdata>\[CDATA\[) ) ) | (?<proc>\?) )? %six',
        self::PARSE_MODE_ELEMENT  => '% (?<close>/)? (?<element> .*? ) (?<empty> / )? > (?<text> [^<]* ) %six',
        self::PARSE_MODE_ENTITY   => '% (?<entity> .*? ) > (?<text> [^<]* ) %six',
        self::PARSE_MODE_COMMENT  => '% (?<comment> .*? ) --> (?<text> [^<]* ) %six',
        self::PARSE_MODE_CDATA    => '% (?<cdata> .*? ) \]\]> (?<text> [^<]* ) %six',
        self::PARSE_MODE_PROC     => '% (?<proc> .*? ) \?> (?<text> [^<]* ) %six',
    );

public function load($string){
    $parseMode = self::PARSE_MODE_NEXT;
    $parseOffset = 0;
    $context = $this;
    while(preg_match($this->_parseModes[$parseMode], $string, $match, PREG_OFFSET_CAPTURE, $parseOffset)){
        if($parseMode == self::PARSE_MODE_NEXT){
            switch(true){
                case (!($match['entity'][0] || $match['comment'][0] || $match['cdata'][0] || $match['proc'][0])):
                    $parseMode = self::PARSE_MODE_ELEMENT;
                    break;
                case ($match['proc'][0]):
                    $parseMode = self::PARSE_MODE_PROC;
                    break;
                case ($match['cdata'][0]):
                    $parseMode = self::PARSE_MODE_CDATA;
                    break;
                case ($match['comment'][0]):
                    $parseMode = self::PARSE_MODE_COMMENT;
                    break;
                case ($match['entity'][0]):
                    $parseMode = self::PARSE_MODE_ENTITY;
                    break;
            }
        }else{
            switch($parseMode){
                case (self::PARSE_MODE_ELEMENT):
                    switch(true){
                        case (!($match['close'][0] || $match['empty'][0])):
                            $context = $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['empty'][0]):
                            $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['close'][0]):
                            $context = $context->_parent;
                            break;
                    }
                    break;
                case (self::PARSE_MODE_ENTITY):
                    $context->addChild(new ZuqMLEntity($match['entity'][0]));
                    break;
                case (self::PARSE_MODE_COMMENT):
                    $context->addChild(new ZuqMLComment($match['comment'][0]));
                    break;
                case (self::PARSE_MODE_CDATA):
                    $context->addChild(new ZuqMLCharacterData($match['cdata'][0]));
                    break;
                case (self::PARSE_MODE_PROC):
                    $context->addChild(new ZuqMLProcessingInstruction($match['proc'][0]));
                    break;
            }
            $parseMode = self::PARSE_MODE_NEXT;
        }
        if(trim($match['text'][0])){
            $context->addChild(new ZuqMLText($match['text'][0]));
        }
        $parseOffset = $match[0][1] + strlen($match[0][0]);
    }

}

Is it complete? Nope.

Is it unbreakable? Certainly not.

Is it fast? Haven't benchmarked, but I cannot imagine it's as fast as DOM.

Does it support XPath/XQuery? Obviously not.

Does it validate or perform any other auxiliary tasks? Sure doesn't.

Will it supersede DOM? Hell no.

However, will it parse this?

<?xml version="1.0" encoding="utf-8"?>
<!ENTITY name="value">
<root>
    <node>
        <node />
        Foo
        <node name="value">
            <node>Bar</node>
        </node>
        <!-- Comment -->
    </node>
    <node>
        <[CDATA[ Character Data ]]>
    </node>
</root>

Yes. Yes it will.

While I would welcome this thread becoming a Community Wiki given it meets the requirements, I'll turn this statement into a question.

Focusing on the regex, can anyone foresee a situation under which this would fail horribly when used against well-formed markup? I think I've covered all my bases.

I have no intention of "stirring the pot", however I'd like some insight from both sides of the coin.

Note also that the purpose for having written this was that SimpleXML was too simple, and DOM was too strict for one of my applications.

解决方案

Focusing on the regex, can anyone foresee a situation under which this would fail horribly when used against well-formed markup?When run against the XML conformance test suite, how many well-formed XML documents does it reject, and how many ill-formed XML documents does it accept?

Perhaps the biggest objection from those who share the culture of the XML community is that it will not only parse most well-formed XML documents, it will also parse most non-XML documents, in the sense that it doesn't tell you they are ill-formed. Now perhaps you think that doesn't matter too much in your environment - but in the end, if you accept ill-formed documents, then people will start sending you ill-formed documents, and before long you are in the same mess as HTML, where you have to accept any old rubbish for legacy reasons.

I don't know enough PHP to judge quickly how well your code will work against well-formed XML. But I question the motivation - why one earth would you want to write a cheap-and-dirty-and-slow XML parser by hand when there are perfectly good-and-correct-and-fast-and-free ones available off the shelf?

这篇关于用Regex解析XML / XHTML数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆