Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素 [英] Simplexml: parsing HTML leaves out nested elements inside an element with a text node

查看:153
本文介绍了Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析大约10000个单词和描述的特定html文档(某种字典). 一切顺利,直到我注意到无法正确解析特定格式的条目.

I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.

这里是一个例子:

    <?php
    $html = '
        <p>
            <b>
                <span>zot; zotz </span>
            </b>
            <span>Nista; nula. Isto
                <b>zilch; zip.</b>
            </span>
        </p>
        ';

    $xml = simplexml_load_string($html);

    var_dump($xml);
    ?>

var_dump()的结果是:

Result of var_dump() is:

    object(SimpleXMLElement)#1 (2) {
      ["b"]=>
      object(SimpleXMLElement)#2 (1) {
        ["span"]=>
        string(10) "zot; zotz "
      }
      ["span"]=>
      string(39) "Nista; nula. Isto

            "
    }

如您所见-Simplexml将文本节点保留在标记内,但将子节点和文本保留在其中.

As you can see - Simplexml kept text node inside tag but left out a child node and text inside.

我也尝试过:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xml = simplexml_import_dom($doc);

具有相同的结果.

在我看来,这是解析html时的常见问题,我尝试对其进行谷歌搜索,但唯一承认此问题的地方是此博客: https://hakre. wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ 但没有提供任何解决方案.

As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.

关于在SO中解析HTML的文章和答案太笼统了.

There is just too generalized posts and answers about parsing HTML in SO.

有没有一种简单的方法可以解决这个问题? 或者,我应该改变策略吗?

Is there a simple way of dealing with this? Or, should I change my strategy?

推荐答案

您的观察是正确的:SimpleXML此处仅提供子元素节点,而不提供子文本节点.解决方案是切换到DOMDocument,因为它可以访问那里的所有节点,文本和元素子级.

Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

此示例显示dom_import_simplexml用于更具体的<span>元素节点,并且遍历是根据相应的 DOMElement 对象的子代完成的.

This example shows that dom_import_simplexml is used on the more specific <span> element-node and the traversal is the done over the children of the according DOMElement object.

输出:

 - DOMText : Nista; nula. Isto

 - DOMElement : zilch; zip.
 - DOMText : 

第一个条目是<span>元素内的第一个文本节点.紧随其后的是<b>元素(该元素再次包含一些文本),然后是另一个仅由空格组成的文本节点.

The first entry is the first text-node within the <span> element. It is followed by the <b> element (which again contains some text) and then from another text-node that consists of whitespace only.

SimpleXMLElement 太简单而无法在XML文档中进行更多区分的数据访问时,dom_import_simplexml函数特别有用.就像您在这里面对的情况一样.

The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.

完整示例:

$html = <<<HTML
<p>
    <b>
        <span>zot; zotz </span>
    </b>
    <span>Nista; nula. Isto
        <b>zilch; zip.</b>
    </span>
</p>
HTML;

$xml = simplexml_load_string($html);

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

这篇关于Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆