PHP DOMDocument获取两个标签集合之间的文本 [英] PHP DOMDocument get text between two SETS of tags

查看:105
本文介绍了PHP DOMDocument获取两个标签集合之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用Xpath来解析两个 SETS 标签之间的文本?例如,请参见示例:

Is there a way to use Xpath to parse text between two SETS of tags? For example, see example:

<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>

我想通过获取SPAN标记集之间的文本来解析以获取如下所示的数组:

I want to parse to get an array like below by getting the text between the sets of SPAN tags:

array[0] = "Blah blah blah blah.";
array[1] = "Yada yada yada yada.";
array[2] = "Foo foo foo foo.";
array[3] = "Hmm hmm hmm hmm.";

我可以使用DOMDocument简单地做到这一点吗?如果没有,实现此目标的最佳方法是什么?请注意,句子中间可能有或标记.如:

Can I use DOMDocument to do this simply? If not, what is the best way to achieve this? Please note that there may be or tags in the middle of the sentences. Such as:

...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>...

推荐答案

更新

您似乎想要想要一份简单的清单,因此请添加此具体示例,以免造成混淆:

Seems you did want a flat list so im adding this specific example so there is no confusion:

$html = '<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>';

$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select THE TEXT NODES of all p elements with the class pp 
// - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]/text()');

$nodes = array();
// simply transform the resulting DOMNodeList into an array
// for easier consumption/manipulation
foreach($found as $textNode) {
    $node[] = $textNode->nodeValue;
}

print_r($nodes);

产生:

Array
(
    [0] => 

    [1] => Blah blah blah blah. 
    [2] =>  Yada 
    yada yada yada. 
    [3] => Foo foo foo foo.

    [4] => 

    [5] => Hmm hmm hmm hmm. 

)


如果情况总是如此简单,我想您可以使用xpath来获取p.pp中子DOMText节点的内容.


If the case is always this simple i think you could just use xpath to get the content of child DOMText nodes within the p.pp.

$html = '<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>';

$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select all p elements with the class pp - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]');

$nodes = array();

foreach($found as $p) {
    // for each p element, pull its text nodes.
    $textNodes = $finder->query('text()', $p);
    $textStr = '';
    // loop over the textNodes and concat them into a single string
    foreach ($textNodes as $n) {
        $textStr .= $n->nodeValue;
    }
    // push the compiled string onto the array
    $nodes[] = $textStr;
}

print_r($nodes);

这将产生如下结果:

Array
(
    [0] => 
    Blah blah blah blah.  Yada 
    yada yada yada. Foo foo foo foo.

    [1] => 
    Hmm hmm hmm hmm. 

)

如果您确实确实希望将每个文本节点分开,则只需更改循环即可:

If you really do want each text node separately you just need to change the loop:

foreach($found as $p) {
    // for each p element, pull its text nodes.
    $textNodes = $finder->query('text()', $p);
    $textArr = array();
    // loop over the textNodes and concat them into a single string
    foreach ($textNodes as $n) {
        $textArr[] = $n->nodeValue;
    }
    // push the compiled string onto the array
    $nodes[] = $textArr;
}

哪个会给你:

Array
(
    [0] => Array
        (
            [0] => 

            [1] => Blah blah blah blah. 
            [2] =>  Yada 
    yada yada yada. 
            [3] => Foo foo foo foo.

        )

    [1] => Array
        (
            [0] => 

            [1] => Hmm hmm hmm hmm. 

        )

)

很明显,您可以看到它抓住了换行符,如果不希望出现的换行符,可以使用所选的数组过滤方法轻松过滤掉.或者,您可以查看XPath和DOMDocument设置来对此进行调整,IIRC中有一些设置涉及如何解释(或不解释)空格,这些设置可能会让您避免这种情况,但是如果在其他位置进行其他处理,也会产生其他后果.相同的DOMDocument实例.

Obviously as you can see it has grabbed line breaks you can easily filter those with your array filtering method of choice if they are undesirable. Or you can look into XPath and DOMDocument settings to adjust this, IIRC there are some settings dealing with how whitespace is interpreted (or not) that would probably let you avoid that but that could have some other consequences as well if you doing other processing on the same DOMDocument instance.

这篇关于PHP DOMDocument获取两个标签集合之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆