Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素 [英] Simplexml: parsing HTML leaves out nested elements inside an element with a text node

查看：153 发布时间：2020/5/25 1:33:43 php xml parsing html-parsing simplexml

本文介绍了Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解析大约10000个单词和描述的特定html文档(某种字典). 一切顺利，直到我注意到无法正确解析特定格式的条目.

I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.

这里是一个例子:

    <?php
    $html = '
        <p>
            <b>
                <span>zot; zotz </span>
            </b>
            <span>Nista; nula. Isto
                <b>zilch; zip.</b>
            </span>
        </p>
        ';

    $xml = simplexml_load_string($html);

    var_dump($xml);
    ?>

var_dump()的结果是:

Result of var_dump() is:

    object(SimpleXMLElement)#1 (2) {
      ["b"]=>
      object(SimpleXMLElement)#2 (1) {
        ["span"]=>
        string(10) "zot; zotz "
      }
      ["span"]=>
      string(39) "Nista; nula. Isto

            "
    }

如您所见-Simplexml将文本节点保留在标记内，但将子节点和文本保留在其中.

As you can see - Simplexml kept text node inside tag but left out a child node and text inside.

我也尝试过:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xml = simplexml_import_dom($doc);

具有相同的结果.

在我看来，这是解析html时的常见问题，我尝试对其进行谷歌搜索，但唯一承认此问题的地方是此博客: https://hakre. wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ 但没有提供任何解决方案.

As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.

关于在SO中解析HTML的文章和答案太笼统了.

There is just too generalized posts and answers about parsing HTML in SO.

有没有一种简单的方法可以解决这个问题? 或者，我应该改变策略吗?

Is there a simple way of dealing with this? Or, should I change my strategy?

推荐答案

您的观察是正确的:SimpleXML此处仅提供子元素节点，而不提供子文本节点.解决方案是切换到DOMDocument，因为它可以访问那里的所有节点，文本和元素子级.

Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

此示例显示dom_import_simplexml用于更具体的元素节点，并且遍历是根据相应的 DOMElement 对象的子代完成的.

This example shows that dom_import_simplexml is used on the more specific  element-node and the traversal is the done over the children of the according DOMElement object.

输出:

 - DOMText : Nista; nula. Isto

 - DOMElement : zilch; zip.
 - DOMText :

第一个条目是元素内的第一个文本节点.紧随其后的是元素(该元素再次包含一些文本)，然后是另一个仅由空格组成的文本节点.

The first entry is the first text-node within the  element. It is followed by the  element (which again contains some text) and then from another text-node that consists of whitespace only.

当 SimpleXMLElement 太简单而无法在XML文档中进行更多区分的数据访问时，dom_import_simplexml函数特别有用.就像您在这里面对的情况一样.

The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.

完整示例:

$html = <<<HTML
<p>
    <b>
        <span>zot; zotz </span>
    </b>
    <span>Nista; nula. Isto
        <b>zilch; zip.</b>
    </span>
</p>
HTML;

$xml = simplexml_load_string($html);

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

这篇关于Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素 [英] Simplexml: parsing HTML leaves out nested elements inside an element with a text node

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素 [英] Simplexml: parsing HTML leaves out nested elements inside an element with a text node

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭