PHP的DOMXPath在匹配的文本中删除了我的标签 [英] PHP's DOMXPath is stripping out my tags inside the matched text

查看:117
本文介绍了PHP的DOMXPath在匹配的文本中删除了我的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨天我问了这个问题,当时正是我所需要的,但在处理一些实时数据时,我发现这并没有达到我的预期。 使用PHP的HTML DOMDocument解析HTML



它从HTML页面获取数据,但它也会去除捕获的文本块内的所有HTML标记,这不是我想要的。 (我可能不想拿出一些标签,但不是全部,这可以在以后完成)。

这是DOM的一个常见问题:如果你想获取标签的内容以及所有子项的内容,你必须做更多的工作。



基本上,您必须遍历与XPath查询匹配的节点的子节点,以获取其内容。



有一种解决方案在其中一个中提出 DOMElement class - 请参阅 本笔记



将此解决方案集成到您已拥有的代码中,应该为您的声明提供如下所示的内容HTML标签的子标签:

  $ html =<<<< HTML 
< div class =main>
< div class =text>
< p>
捕获< strong>文字< / strong> < EM大于1< / EM>
< / p>
< p>
以及其他一些< strong>文字< / strong>
< / p>
< / div>
< / div>
HTML;



而且,要从HTML字符串中提取数据,您可以使用像这样:

  $ dom = new DOMDocument(); 
$ dom-> loadHTML($ html);

$ xpath = new DOMXPath($ dom);

$ tags = $ xpath-> query('// div [@ class =main] / div [@ class =text]');
foreach($ tags as $ tag){
$ innerHTML ='';

//参见http://fr.php.net/manual/en/class.domelement.php#86803
$ children = $ tag-> childNodes;
foreach($ children as $ child){
$ tmp_doc = new DOMDocument();
$ tmp_doc-> appendChild($ tmp_doc-> importNode($ child,true));
$ innerHTML。= $ tmp_doc-> saveHTML();
}

var_dump(trim($ innerHTML));
}

唯一改变的是 foreach loop:不必使用 $ tag-> nodeValue ,您必须迭代子元素。





这给了我以下输出:

  string'< ; p为H. 
捕获< strong>文字< / strong> < EM大于1< / EM>
< / p>


< p>
以及其他一些< strong>文字< / strong>
< / p>'(length = 150)

完整内容已匹配的< div> 标签及其所有子项 - 包括标签。

>
注意:手册的用户注释中经常会有有趣的想法和解决方案; - )


I asked this question yesterday, and at the time it was just what I needed, but while working with some live data I discovered that is wasn't quite doing what I expected. Parse HTML with PHP's HTML DOMDocument

It gets the data from the HTML page, but then it also strips out all the HTML tags inside the captured block of text, which isn't what I want. (I might wan't to take some of the tags out, but not all, and this can be done later)

解决方案

That's a common problem with DOM : you have to do a bit more work if you want to get the content of a tag, and the content of all its children.

Basically, you have to loop over the child nodes of the one you've matched with your XPath query, to get their contents.

There is a solution proposed in one one the user notes on the manual page of the DOMElement class -- see this note.


Integrating this solution into the code you already have should give you something that looks like this for the declaration of the HTML string, with sub-tags :

$html = <<<HTML
<div class="main">
    <div class="text">
        <p>
            Capture this <strong>text</strong> <em>1</em>
        </p>
        <p>
            And some other <strong>text</strong>
        </p>
    </div>
</div>
HTML;


And, to extract the data from that HTML string, you can use something like that :

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    // see http://fr.php.net/manual/en/class.domelement.php#86803
    $children = $tag->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        $innerHTML .= $tmp_doc->saveHTML();
    }

    var_dump(trim($innerHTML));
}

The only thing that has changed is the content of the foreach loop : instead of just using $tag->nodeValue, you have to iterate over the child elements.


Which gives me the following output :

string '<p>
            Capture this <strong>text</strong> <em>1</em>
        </p>


<p>
            And some other <strong>text</strong>
        </p>' (length=150)

Which is the full content of the <div> tag that was matched, and all its children -- including the tags.


Note : there are often interesting ideas and solution in the users notes of the manual ;-)

这篇关于PHP的DOMXPath在匹配的文本中删除了我的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆