XPath递归删除空的DOM节点? [英] XPath to recursively remove empty DOM nodes?

查看:65
本文介绍了XPath递归删除空的DOM节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图找到一种方法来清除HTML源中的一堆空DOM元素,例如:

I am trying to find a way to cleanup a bunch of empty DOM elements from an HTML source like this:

<div class="empty">
    <div>&nbsp;</div>
    <div></div>
</div>
<a href="http://example.com">good</a>
<div>
    <p></p>
</div>
<br>
<img src="http://example.com/logo.png" />
<div></div>

但是,我不想损害有效的元素或换行符。因此结果应该是这样的:

However, I don't want to harm valid elements or line breaks. So the result should be something like this:

<a href="http://example.com">good</a>
<br>
<img src="http://example.com/logo.png" />

到目前为止,我已经尝试过一些XPath:

So far I have tried some XPaths like this:

$xpath = new DOMXPath($dom);

//$x = '//*[not(*) and not(normalize-space(.))]';
//$x = '//*[not(text() or node() or self::br)]';
//$x = 'not(normalize-space(.) or self::br)';
$x = '//*[not(text() or node() or self::br)]';

while(($nodeList = $xpath->query($x)) && $nodeList->length > 0) {
    foreach ($nodeList as $node) {
        $node->parentNode->removeChild($node);
    }
}

有人可以告诉我正确的XPath删除空DOM吗?空无用的节点? (img,br和input即使为空也可以达到目的)

Can someone show me the correct XPath to remove empty DOM nodes that serve no purpose if empty? (img, br, and input serve a purpose even if empty)

当前输出:

<div>
    <div>&nbsp;</div>

</div>
<a href="http://example.com">good</a>
<div>

</div>
<br>



更新



为了澄清,我是寻找以下任一XPath查询:

Update

To clarify, I am looking for an XPath query that is either:


  • 递归匹配空节点,直到找到所有节点(包括空节点的父节点)

  • 每次清理后可以成功运行多次(如我的示例所示)

推荐答案

I。初始解决方案:

XPath是XML文档的查询语言。 XPath表达式的使用只会从XML文档中选择节点或提取非节点信息,而不会更改XML文档。因此,评估XPath表达式绝不会删除或插入节点-XML文档保持不变。

XPath is a query language for XML documents. As such, the evaluation of an XPath expression only selects nodes or extracts non-node information from the XML documen, but never alters the XML document. Thus evaluating an XPath expression never deletes or inserts nodes -- the XML document remains the same.

您想要的是清理一堆空的DOM元素

这是最可信,唯一的官方证实的(我们说 normative )上的来源- W3C XPath 1.0建议

This is confirmed by the most credible and the only official (we say normative) source on XPath -- the W3C XPath 1.0 Recommendation:


XPath的主要目的是解决XML [XML] $ b的某些部分$ b document。为了支持这一主要目的,它还提供了基本的
工具来处理字符串,数字和布尔值XPath
使用了一种紧凑的非XML语法来方便在URI中使用XPath
和XML属性值XPath在XML文档的抽象逻辑
结构上运行,而不是在其表面语法上运行。通过在XML文档的层次结构中导航
的URL中使用路径符号来获得其名称。

因此,必须将某些其他语言与XPath结合使用才能实现require功能

XSLT是专门为XML转换设计的语言。

XSLT is a language especially designed for XML transformation.

以下是基于XSLT的示例-简短而简单的XSLT转换,可以执行请求的清除操作

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
 "*[not(string(translate(., '&#xA0;', '')))
  and
    not(descendant-or-self::*
          [self::img or self::input or self::br])]"/>
</xsl:stylesheet>

应用于提供的XML (已更正为格式正确的XML文档) :

When applied on the provided XML (corrected to become wellformed XML document):

<html>
    <div class="empty">
        <div>&#xA0;</div>
        <div></div>
    </div>
    <a href="http://example.com">good</a>
    <div>
        <p></p>
    </div>
    <br />
    <img src="http://example.com/logo.png" />
    <div></div>
</html>

产生了所需的正确结果

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

说明


  1. 身份规则将按原样复制为其选择要执行的每个节点。

  1. The identity rule copies "as-is" every node for which it is selected for execution.

有一个模板,可以覆盖任何元素的身份模板( img input br )的字符串值是空字符串,从中删除了任何& nbsp; 的字符串值。该模板的主体为空,实际上会删除匹配的元素-匹配的元素不会复制到输出中。

There is a single template, overriding the identity template for any element (with the exception of img, input and br), whose string value from which any &nbsp; has been removed, is the empty string. The body of this template is empty, which effectively "deletes" the matched element -- the matched element isn't copied to the output.






II。更新

OP明确表示他想要一个或多个XPath表达式,这些表达式为:

The OP clarifies that he wants one or more XPath expressions that:


可以在每次清理后成功运行多次。

足够有趣的是,有一个XPath表达式可以精确选择所有需要删除的节点-因此完全避免了多次清理

//*[not(normalize-space((translate(., '&#xA0;', ''))))
  and
    not(descendant-or-self::*[self::img or self::input or self::br])
    ]
     [not(ancestor::*
             [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                         and
                           not(descendant-or-self::*
                                  [self::img or self::input or self::br])
                          ]
                    )
             =
              count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                      and
                        not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                        ]
                   )
              ]
          )
     ]

基于XSLT的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
   "//*[not(normalize-space((translate(., '&#xA0;', ''))))
      and
        not(descendant-or-self::*[self::img or self::input or self::br])
       ]
        [not(ancestor::*
               [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                           and
                             not(descendant-or-self::*
                                    [self::img or self::input or self::br])
                             ]
                      )
               =
                count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                        and
                          not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                          ]
                      )
               ]
            )
        ]
 "/>
</xsl:stylesheet>

在提供的(且格式良好的)XML文档上应用此转换时(上) ,所有节点均按原样复制,但XPath表达式选择的节点除外

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>

说明

让我们根据问题中空的定义用 $ vAllEmpty 表示所有空节点。

Let us denote with $vAllEmpty all the nodes that are "empty" according to the definition of "empty" in the question.

$ vAllEmpty 用以下XPath表达式表示:

$vAllEmpty is expressed with the following XPath expression:

   //*[not(normalize-space((translate(., '&#xA0;', ''))))
     and
       not(descendant-or-self::*
             [self::img or self::input or self::br])

      ]

要删除所有这些,我们只需从 $ vAllEmpty

For all of these to be deleted, we need to delete just the "top nodes" from $vAllEmpty

让我们将所有这些顶部节点表示为: $ vTopEmpty

Let us denote the set of all such "top nodes" as: $vTopEmpty.

$ vTopEmpty

$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]

这个选择那些 $ vAllEmpty 中没有任何祖先元素的节点也在 $ vAllEmpty 中。

this selects those nodes from $vAllEmpty that don't have any ancestor element that is also in $vAllEmpty.

最后一个XPath表达式具有与其等效的XPath 1.0表达式:

The last XPath expression has its equivalent XPath 1.0 expression:

$vAllEmpty[not(ancestor::*[count(.|$vAllEmpty) = count($vAllEmpty)])]

现在,我们用上面定义的扩展XPath表达式替换最后一个表达式 $ vAllEmpty ,这就是我们获得最终表达式的方式,该表达式仅选择要删除的顶级节点:

Now, we replace in the last expression $vAllEmpty with its expanded XPath expression as defined above and this is how we obtain the final expression, that selects only the "top nodes to delete":

//*[not(normalize-space((translate(., '&#xA0;', ''))))
  and
    not(descendant-or-self::*[self::img or self::input or self::br])
    ]
     [not(ancestor::*
             [count(.| //*[not(normalize-space((translate(., '&#xA0;', ''))))
                         and
                           not(descendant-or-self::*
                                  [self::img or self::input or self::br])
                          ]
                    )
             =
              count(//*[not(normalize-space((translate(., '&#xA0;', ''))))
                      and
                        not(descendant-or-self::*
                                 [self::img or self::input or self::br])
                        ]
                   )
              ]
          )
     ]

使用变量进行基于XSLT-2.0的简短验证

<xsl:stylesheet version="2.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
     <xsl:strip-space elements="*"/>

     <xsl:variable name="vAllEmpty" select=
      "//*[not(normalize-space((translate(., '&#xA0;', ''))))
         and
           not(descendant-or-self::*
                 [self::img or self::input or self::br])

          ]"/>

     <xsl:variable name="vTopEmpty" select=
     "$vAllEmpty[not(ancestor::* intersect $vAllEmpty)]"/>

     <xsl:template match="node()|@*">
      <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
     </xsl:template>

     <xsl:template match="*[. intersect $vTopEmpty]"/>
</xsl:stylesheet>

此转换按原样复制每个节点,但属于<$的任何节点除外c $ c> $ vTopEmpty 。结果是正确且预期的结果:

This transformation copies every node "as-is" with the exception of any node that belongs to $vTopEmpty . The result is the correct and expected one:

<html>
   <a href="http://example.com">good</a>
   <br/>
   <img src="http://example.com/logo.png"/>
</html>






III。替代解决方案(可能需要多次清除)

替代方法不是尝试指定要删除的节点,而是指定要删除的节点keep-则要删除的节点是所有节点与要保留的节点之间的集合差。

An alternative approach is not to attempt to specify the nodes to delete, but to specify the nodes to keep -- then the nodes to delete are the set difference between all nodes and the nodes to keep.

要保留的节点由此XPath表达式选择

  //node()
    [self::input or self::img or self::br
    or
     self::text()[normalize-space(translate(.,'&#xA0;',''))]
    ]
     /ancestor-or-self::node()

然后删除的节点为

  //node()
     [not(count(.
              |
                //node() 
                   [self::input or self::img or self::br
                  or
                    self::text()[normalize-space(translate(.,'&#xA0;',''))]
                   ]
                    /ancestor-or-self::node()
                )
        =
         count(//node()
                  [self::input or self::img or self::br
                 or
                   self::text()[normalize-space(translate(.,'&#xA0;',''))]
                  ]
                   /ancestor-or-self::node()
               )
         )
     ]

但是,请注意,这些都是所有个要删除的节点,而不仅仅是要删除的顶级节点。可以只表达要删除的最高节点,但是结果表达相当复杂。如果尝试删除所有要删除的节点,则会出现错误,因为要删除的最高级节点的后代按文档顺序跟随它们。

However, do note that these are all nodes to delete and not only the "top nodes to delete". It is possible to express only the "top nodes to delete", but the resulting expression is rather complicated. If one attempts to delete all-nodes-to delete, there will be errors due to the fact that the descendants of the "top nodes to delete" follow them in document order.

这篇关于XPath递归删除空的DOM节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆