DOM文档在php中 [英] DOMDocument in php

查看:26
本文介绍了DOM文档在php中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始阅读有关 DOM 的文档和示例,以便抓取和解析文档.

I have just started reading documentation and examples about DOM, in order to crawl and parse the document.

例如,我有部分文档如下所示:

For example I have part of document shown below:

    <div id="showContent">
    <table>
    <tr>
        <td>
         Crap
        </td>
    </tr>
<tr>
          <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td>
          <td width="10">&nbsp;</td>
          <td valign="top"><table cellspacing="0" cellpadding="0" border="0">
              <tbody><tr>
                <td height="30"><a class="px11" href="link">title</a><a><br>
                    <span class="px10"></span>
                </a></td>
              </tr>
              <tr>
                <td><img height="1" width="580" src="crap"></td>
              </tr>
              <tr>
                <td align="right">
                    <a href="link"><img height="16" border="0" width="65" src="/buy"></a>
                </td>
              </tr>
              <tr>
                <td valign="top" class="px10">
                    <p style="width: 500px;">description.</p>
                </td>
              </tr>
          </tbody></table></td>
        </tr>
    <tr>
        <td>
Crap
        </td>
    </tr>
    <tr>
        <td>
         Crap
        </td>
    </tr>
    </table>
    </div>

我正在尝试使用下面的代码来获取所有的tr标签,并分析其中是否有废话或信息:

I'm trying to use the following code to get all the tr tags and analyze whether there is crap or information inside them:

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


$tags = $xpath->query('.//div[@id="showContent"]');
foreach ($tags as $tag) {
    $string="";
    $string=trim($tag->nodeValue);
    if(strlen($string)>3) {
        echo $string;
        echo '<br>';
    }
}

但是我得到的只是没有标签的剥离字符串,例如:

However I'm getting just stripped string without the tags, for example:

Crap

Crap
Title
Description

但我想得到:

<tr>
   <td>Crap</td>
</tr>
<tr>
   <a href="link">title</a>
</tr>

如何保留html节点(标签)?

How to keep html nodes (tags)?

推荐答案

如果你想使用 DOM,你必须理解这个概念.DOM 文档中的所有内容,包括 DOMDocument,都是一个节点.

If you want to work with DOM you have to understand the concept. Everything in a DOM Document, including the DOMDocument, is a Node.

DOMDocument 是节点的层次树结构.它从一个根节点开始.该根节点可以有子节点,所有这些子节点都可以有自己的子节点.基本上,DOMDocument 中的所有内容都是某种节点类型,无论是元素、属性还是文本内容.

The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. That root node can have child nodes and all these child nodes can have child nodes on their own. Basically everything in a DOMDocument is a node type of some sort, be it elements, attributes or text content.

          HTML                               Legend: 
         /                                  UPPERCASE = DOMElement
       HEAD  BODY                            lowercase = DOMAttr
      /                                     "Quoted"  = DOMText
    TITLE        DIV - class - "header"
     |             
"The Title"        H1
                    |
           "Welcome to Nodeville"

上图显示了一个带有一些节点的 DOMDocument.有一个带有两个子元素(HEAD 和 BODY)的根元素 (HTML).连接线称为轴.如果您沿着轴到 TITLE 元素,您会看到它有一个 DOMText 叶.这很重要,因为它说明了一个经常被忽视的事情:

The diagram above shows a DOMDocument with some nodes. There is a root element (HTML) with two children (HEAD and BODY). The connecting lines are called axes. If you follow down the axis to the TITLE element, you will see that it has one DOMText leaf. This is important because it illustrates an often overlooked thing:

<title>The Title</title>

不是一个,而是两个节点.带有 DOMText 子元素的 DOMElement.同样,这个

is not one, but two nodes. A DOMElement with a DOMText child. Likewise, this

<div class="header">

实际上是三个节点:带有 DOMAttr 的 DOMElement 和 DOMText.因为所有这些都从 DOMNode 继承了它们的属性和方法,所以熟悉 非常重要DOMNode 类.

is really three nodes: the DOMElement with a DOMAttr holding a DOMText. Because all these inherit their properties and methods from DOMNode, it is essential to familiarize yourself with the DOMNode class.

实际上,这意味着您获取的 DIV 链接到文档中的所有其他节点.您可以在任何时候一直到根元素或向下到叶子.这一切都在那里.您只需查询或遍历文档以获取所需信息.

In practise, this means the DIV you fetched is linked to all the other nodes in the document. You could go all the way to the root element or down to the leaves at any time. It's all there. You just have to query or traverse the document for the wanted information.

您是通过迭代 DIVchildNodes 还是使用 getElementByTagName() 或 XPath 来做到这一点取决于您.您只需要了解您使用的不是原始 HTML,而是代表整个 HTML 文档的节点.

Whether you do that by iterating the childNodes of the DIV or use getElementByTagName() or XPath is up to you. You just have to understand that you are not working with raw HTML, but with nodes representing that entire HTML document.

如果您在从文档中提取特定信息方面需要帮助,则需要说明您希望从中获取哪些信息.例如,您可以询问如何从表中获取所有链接,然后我们可以回答如下:

If you need help with extracting specific information from the document, you need to clarify what information you want to fetch from it. For instance, you could ask how to fetch all the links from the table and then we could answer something like:

$div = $dom->getElementById('showContent');
foreach ($div->getElementsByTagName('a') as $link) 
{
    echo $dom->saveXML($link);
}

但除非您更具体,否则我们只能猜测哪些节点可能是相关的.

But unless you are more specific, we can only guess which nodes might be relevant.

如果您需要更多关于如何使用 DOM 的示例和代码片段,请浏览我之前对相关问题的回答:

If you need more examples and code snippets on how to work with DOM browse through my previous answers to related questions:

到现在为止,对于每个基本到中等的 DOM 用例,都应该有一个片段.

By now, there should be a snippet for every basic to medium UseCase you might have with DOM.

这篇关于DOM文档在php中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆