PHP DOMDocument/XPath:获取HTML文本和包围的标签 [英] PHP DOMDocument / XPath: Get HTML-text and surrounded tags

查看:562
本文介绍了PHP DOMDocument/XPath:获取HTML文本和包围的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找此功能:

给出的是此html页:

Given is this html-Page:

<body>
 <h1>Hello,
  <b>world!</b>
 </h1>
</body>

我想得到一个仅包含DISTINCT文本元素的数组 (没有重复项)以及围绕文本元素的标记数组:

I want to get an array that only contains the DISTINCT text elements (no duplicates) and an array of the tags that surround the text elements:

上面"html"的结果将是一个看起来像这样的数组:

The result to the above "html" would be an array that looks like this:

array => 
 "Hello," surrounded by => "h1" and "body"
 "world!" surrounded by => "b", "h1" and "body"

我经常这样做:

$res=$xpath->query("//body//*/text()");

这给了我独特的文本内容,但是省略了html标签.

which gives me the distinct text-contents but that omits the html-tags.

当我这样做时:

$res=$xpath->query("//body//*");

我得到重复的文本,每个标记星座对应一个文本:例如:世界!"会出现3次, 一次是"body",一次是"h1",一次是"b",但我似乎无法 获取哪些文本是重复的信息.只是检查重复的文本是 还不够,因为重复的文本有时只是旧文本或网站的子字符串 可能包含真实的重复文本,然后将其丢弃,这是错误的.

I get duplicate texts, one for each tag-constellation: e.g.: "world!" would show up 3 times, one time for "body", one time for "h1" and one time for "b" but I don't seem to be able to get the information which texts are acutally duplicates. Just checking for duplicate text is not sufficient, as duplicate texts are sometimes just substrings of former texts or a website could contain real duplicate text which would then be discarded which is wrong.

我该如何解决这个问题?

How could I solve this issue?

非常感谢您!

托马斯

推荐答案

您可以遍历 parentNodes rel ="noreferrer"> DOMText 节点:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = array();
foreach($xpath->query('/html/body//text()') as $i => $textNode) {
    $textNodes[$i] = array(
        'text' => $textNode->nodeValue,
        'parents' => array()
    );
    for (
        $currentNode = $textNode->parentNode;
        $currentNode->parentNode;
        $currentNode = $currentNode->parentNode
    ) {
        $textNodes[$i]['parents'][] = $currentNode->nodeName;
    }
}
print_r($textNodes);

演示

请注意,loadHTML将添加隐含元素,例如它将添加html和head元素,使用XPath时必须考虑这些元素.还要注意,用于格式化的任何空格都被视为DOMText,因此您可能会获得比预期更多的元素.如果您只想查询非空的DOMText节点,请使用

Note that loadHTML will add implied elements, e.g. it will add html and head elements which you will have to take into account when using XPath. Also note that any whitespace used for formatting is considered a DOMText so you will likely get more elements than you expect. If you only want to query for non-empty DOMText nodes use

/html/body//text()[normalize-space(.) != ""]

演示

这篇关于PHP DOMDocument/XPath:获取HTML文本和包围的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆