PHP中的DOMDocument [英] DOMDocument in php

查看:78
本文介绍了PHP中的DOMDocument的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了开始抓取和解析文档,我刚刚开始阅读有关DOM的文档和示例.

I have just started reading documentation and examples about DOM, in order to crawl and parse the document.

例如,我有下面显示的文档部分:

For example I have part of document shown below:

    <div id="showContent">
    <table>
    <tr>
        <td>
         Crap
        </td>
    </tr>
<tr>
          <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td>
          <td width="10">&nbsp;</td>
          <td valign="top"><table cellspacing="0" cellpadding="0" border="0">
              <tbody><tr>
                <td height="30"><a class="px11" href="link">title</a><a><br>
                    <span class="px10"></span>
                </a></td>
              </tr>
              <tr>
                <td><img height="1" width="580" src="crap"></td>
              </tr>
              <tr>
                <td align="right">
                    <a href="link"><img height="16" border="0" width="65" src="/buy"></a>
                </td>
              </tr>
              <tr>
                <td valign="top" class="px10">
                    <p style="width: 500px;">description.</p>
                </td>
              </tr>
          </tbody></table></td>
        </tr>
    <tr>
        <td>
Crap
        </td>
    </tr>
    <tr>
        <td>
         Crap
        </td>
    </tr>
    </table>
    </div>

我正在尝试使用以下代码来获取所有tr标签并分析其中是否包含废话或信息:

I'm trying to use the following code to get all the tr tags and analyze whether there is crap or information inside them:

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


$tags = $xpath->query('.//div[@id="showContent"]');
foreach ($tags as $tag) {
    $string="";
    $string=trim($tag->nodeValue);
    if(strlen($string)>3) {
        echo $string;
        echo '<br>';
    }
}

但是我得到的是没有标签的剥离字符串,例如:

However I'm getting just stripped string without the tags, for example:

Crap

Crap
Title
Description

但是我想得到:

<tr>
   <td>Crap</td>
</tr>
<tr>
   <a href="link">title</a>
</tr>

如何保留html节点(标签)?

How to keep html nodes (tags)?

推荐答案

如果您想使用DOM,则必须了解该概念. DOM文档中的所有内容(包括DOMDocument)都是一个节点.

If you want to work with DOM you have to understand the concept. Everything in a DOM Document, including the DOMDocument, is a Node.

DOMDocument是节点的分层树结构.它从根节点开始.该根节点可以具有子节点,并且所有这些子节点都可以自己具有子节点.基本上DOMDocument中的所有内容都是某种节点类型,无论是元素,属性还是文本内容.

The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. That root node can have child nodes and all these child nodes can have child nodes on their own. Basically everything in a DOMDocument is a node type of some sort, be it elements, attributes or text content.

          HTML                               Legend: 
         /    \                              UPPERCASE = DOMElement
       HEAD  BODY                            lowercase = DOMAttr
      /          \                           "Quoted"  = DOMText
    TITLE        DIV - class - "header"
     |             \
"The Title"        H1
                    |
           "Welcome to Nodeville"

上图显示了带有某些节点的DOMDocument.有一个带有两个子项(HEAD和BODY)的根元素(HTML).连接线称为轴.如果将轴跟随到TITLE元素,您将看到它具有一个DOMText叶.这很重要,因为它说明了一个经常被忽视的事情:

The diagram above shows a DOMDocument with some nodes. There is a root element (HTML) with two children (HEAD and BODY). The connecting lines are called axes. If you follow down the axis to the TITLE element, you will see that it has one DOMText leaf. This is important because it illustrates an often overlooked thing:

<title>The Title</title>

不是一个,而是两个节点.带有DOMText子级的DOMElement.同样,

is not one, but two nodes. A DOMElement with a DOMText child. Likewise, this

<div class="header">

实际上是三个节点:带有DOMAttr的DOMElement持有DOMText.由于所有这些继承自DOMNode的属性和方法,因此必须熟悉 DOMNode类.

is really three nodes: the DOMElement with a DOMAttr holding a DOMText. Because all these inherit their properties and methods from DOMNode, it is essential to familiarize yourself with the DOMNode class.

实际上,这意味着您提取的DIV已链接到文档中的所有其他节点.您可以随时移至根元素,也可以移至叶子.都在那里.您只需要查询或遍历文档以获取所需信息.

In practise, this means the DIV you fetched is linked to all the other nodes in the document. You could go all the way to the root element or down to the leaves at any time. It's all there. You just have to query or traverse the document for the wanted information.

是通过迭代DIVchildNodes还是使用getElementByTagName()来执行此操作,否则XPath取决于您.您只需要了解您不是在使用原始HTML,而是使用代表整个HTML文档的节点.

Whether you do that by iterating the childNodes of the DIV or use getElementByTagName() or XPath is up to you. You just have to understand that you are not working with raw HTML, but with nodes representing that entire HTML document.

如果您需要有关从文档中提取特定信息的帮助,则需要澄清要从文档中获取哪些信息.例如,您可以询问如何从表中获取所有链接,然后我们可以回答以下问题:

If you need help with extracting specific information from the document, you need to clarify what information you want to fetch from it. For instance, you could ask how to fetch all the links from the table and then we could answer something like:

$div = $dom->getElementById('showContent');
foreach ($div->getElementsByTagName('a') as $link) 
{
    echo $dom->saveXML($link);
}

但是除非您更具体,否则我们只能猜测哪些节点可能是相关的.

But unless you are more specific, we can only guess which nodes might be relevant.

如果您需要更多有关如何使用DOM的示例和代码片段,请浏览我之前对相关问题的回答:

If you need more examples and code snippets on how to work with DOM browse through my previous answers to related questions:

现在,您可能会对使用DOM的每个基本到中等UseCase都有一个摘要.

By now, there should be a snippet for every basic to medium UseCase you might have with DOM.

这篇关于PHP中的DOMDocument的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆