Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素 [英] Simplexml: parsing HTML leaves out nested elements inside an element with a text node
问题描述
我正在尝试解析大约10000个单词和描述的特定html文档(某种字典). 一切顺利,直到我注意到无法正确解析特定格式的条目.
I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.
这里是一个例子:
<?php
$html = '
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
';
$xml = simplexml_load_string($html);
var_dump($xml);
?>
var_dump()的结果是:
Result of var_dump() is:
object(SimpleXMLElement)#1 (2) {
["b"]=>
object(SimpleXMLElement)#2 (1) {
["span"]=>
string(10) "zot; zotz "
}
["span"]=>
string(39) "Nista; nula. Isto
"
}
如您所见-Simplexml将文本节点保留在标记内,但将子节点和文本保留在其中.
As you can see - Simplexml kept text node inside tag but left out a child node and text inside.
我也尝试过:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
具有相同的结果.
在我看来,这是解析html时的常见问题,我尝试对其进行谷歌搜索,但唯一承认此问题的地方是此博客: https://hakre. wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ 但没有提供任何解决方案.
As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.
关于在SO中解析HTML的文章和答案太笼统了.
There is just too generalized posts and answers about parsing HTML in SO.
有没有一种简单的方法可以解决这个问题? 或者,我应该改变策略吗?
Is there a simple way of dealing with this? Or, should I change my strategy?
推荐答案
您的观察是正确的:SimpleXML此处仅提供子元素节点,而不提供子文本节点.解决方案是切换到DOMDocument,因为它可以访问那里的所有节点,文本和元素子级.
Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
此示例显示dom_import_simplexml
用于更具体的<span>
元素节点,并且遍历是根据相应的 DOMElement 对象的子代完成的.
This example shows that dom_import_simplexml
is used on the more specific <span>
element-node and the traversal is the done over the children of the according DOMElement object.
输出:
- DOMText : Nista; nula. Isto
- DOMElement : zilch; zip.
- DOMText :
第一个条目是<span>
元素内的第一个文本节点.紧随其后的是<b>
元素(该元素再次包含一些文本),然后是另一个仅由空格组成的文本节点.
The first entry is the first text-node within the <span>
element. It is followed by the <b>
element (which again contains some text) and then from another text-node that consists of whitespace only.
当 SimpleXMLElement 太简单而无法在XML文档中进行更多区分的数据访问时,dom_import_simplexml
函数特别有用.就像您在这里面对的情况一样.
The dom_import_simplexml
function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.
完整示例:
$html = <<<HTML
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
HTML;
$xml = simplexml_load_string($html);
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
这篇关于Simplexml:解析HTML会保留带有文本节点的元素内的嵌套元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!