遍历DOM树 [英] Traverse the DOM tree

查看:140
本文介绍了遍历DOM树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为大多数(所有?)PHP清理程序的HTML库,如HTML Purifier严重依赖正则表达式,我以为试图编写一个使用DOMDocument和相关类的HTML消毒剂将是一个有价值的实验。虽然我在这个很早的阶段,但到目前为止这个项目显示了一些希望。



我的想法围绕着一个使用DOMDocument遍历提供的标记中的所有节点的类,将其与白名单进行比较,并删除白名单上的任何内容。 (第一个实现是非常基本的,只是根据它们的类型删除节点,但是我希望能够更复杂地分析节点的属性,链接是否在未来的不同域中的地址项等)。



我的问题是如何遍历DOM树?据我了解,DOM *对象有一个childNodes属性,所以我需要在整个树上递归?此外,DOMNodeLists的早期实验表明,您需要非常小心您删除的顺序,否则您可能会留下项目或触发异常。



如果任何人在PHP中操作DOM树的经验,我将不胜感激您可能对此主题的任何反馈。



编辑:我已经为我的HTML清洁类建立了以下方法。它递归地走DOM树并检查发现的元素是否在白名单中。如果没有,他们被删除。



我遇到的问题是如果删除一个节点,则DOMNodeList中所有后续节点的索引将发生更改。简单地从下到上的工作避免了这个问题。目前仍然是一个非常基本的方法,但我认为这显示了希望。它肯定比HTMLPurifier快很多,但诚然,Purifier做了更多的东西。

  / ** 
*循序移除未列入白名单的DOM元素
* @param DOMNode $ elem
* @return array从DOM中删除的元素列表
* @throws异常如果删除节点失败,而不是引发异常
* /
private function cleanNodes (DOMNode $ elem)
{
$ removed = array();
if(in_array($ elem - > nodeName,$ this - > whiteList))
{
if($ elem - > hasChildNodes())
{
/ *
*迭代元素的孩子。我们倒退的原因是因为
*前进将导致索引在元素被删除时更改
* /
$ children = $ elem - >的childNodes;
$ index = $ children - >长度;
while( - $ index> = 0)
{
$ removed = array_merge($ removed,$ this - > cleanNodes($ children - > item($ index)) );
}
}
}
else
{
//该元素不在白名单中,因此删除它
if($ elem - > parentNode - > removeChild($ elem))
{
$ removed [] = $ elem;
}
else
{
throw new Exception('无法从DOM中删除节点);
}
}
return($ removed);
}


解决方案

看看这个自定义的RecursiveDomIterator:





代码

  class RecursiveDOMIterator implements RecursiveIterator 
{
/ **
* DOMNodeList中的当前位置
* @var Integer
* /
protected $ _position;

/ **
*所有子代的DOMNodeList遍历
* @var DOMNodeList
* /
protected $ _nodeList;

/ **
* @param DOMNode $ domNode
* @return void
* /
public function __construct(DOMNode $ domNode)
{
$ this-> _position = 0;
$ this-> _nodeList = $ domNode-> childNodes;
}

/ **
*返回当前DOMNode
* @return DOMNode
* /
public function current()
{
return $ this-> _nodeList-> item($ this-> _position);
}

/ **
*返回当前迭代器条目的迭代器
* @return递归DOMIterator
* /
public function getChildren ()
{
return new self($ this-> current());
}

/ **
*如果可以为当前条目创建迭代器,则返回。
* @return Boolean
* /
public function hasChildren()
{
return $ this-> current() - > hasChildNodes();
}

/ **
*返回当前位置
* @return整数
* /
public function key()
{
return $ this-> _position;
}

/ **
*将当前位置移动到下一个元素。
* @return void
* /
public function next()
{
$ this-> _position ++;
}

/ **
*将Iterator放回第一个元素
* @return void
* /
public function rewind )
{
$ this-> _position = 0;
}

/ **
*检查当前位置是否有效
* @return Boolean
* /
public function valid()
{
return $ this-> _position< $这 - > _nodeList->长度;
}
}

您可以结合使用 RecursiveIteratorIterator 。使用示例在页面上。



一般来说,使用XPath搜索列入黑名单的节点而不是遍历DOM树会更容易。还要记住,DOM通过自动转义nodeValues中的xml实体已经很好地防止了XSS。



您必须注意的另一件事是,对DOMDocument的任何操作都会立即影响XPath查询可能具有的任何DOMNodeList,并且可能导致跳过的节点操纵他们有关示例,请参阅使用PHP DOM类替换DOMNode 。 / p>

As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocument and related classes would be a worthwhile experiment. While I'm at a very early stage with this, the project so far shows some promise.

My idea revolves around a class that uses the DOMDocument to traverse all nodes in the supplied markup, compare them to a white list, and remove anything not on the white list. (first implementation is very basic, only removing nodes based on their type but I hope to get more sophisticated and analyse the node's attributes, whether links address items in a different domain, etc in the future).

My question is how do I traverse the DOM tree? As I understand it, DOM* objects have a childNodes attribute, so would I need to recurse over the whole tree? Also, early experiments with DOMNodeLists have shown you need to be very careful about the order you remove things otherwise you might leave items behind or trigger exceptions.

If anyone has experience with manipulating a DOM tree in PHP I'd appreciate any feedback you may have on the topic.

EDIT: I've built the following method for my HTML cleaning class. It recursively walks the DOM tree and checks whether the found elements are on the whitelist. If they aren't, they are removed.

The problem I was hitting was that if you delete a node, the indexes of all subsequent nodes in the DOMNodeList changes. Simply working from bottom to top avoids this problem. It's still a very basic approach currently, but I think it shows promise. It certainly works a lot faster than HTMLPurifier, though admittedly Purifier does a lot more stuff.

/**
 * Recursivly remove elements from the DOM that aren't whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName, $this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element's children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist, so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception ('Failed to remove node from DOM');
        }
    }
    return ($removed);
}

解决方案

For a start, you can have a look at this custom RecursiveDomIterator:

Code:

class RecursiveDOMIterator implements RecursiveIterator
{
    /**
     * Current Position in DOMNodeList
     * @var Integer
     */
    protected $_position;

    /**
     * The DOMNodeList with all children to iterate over
     * @var DOMNodeList
     */
    protected $_nodeList;

    /**
     * @param DOMNode $domNode
     * @return void
     */
    public function __construct(DOMNode $domNode)
    {
        $this->_position = 0;
        $this->_nodeList = $domNode->childNodes;
    }

    /**
     * Returns the current DOMNode
     * @return DOMNode
     */
    public function current()
    {
        return $this->_nodeList->item($this->_position);
    }

    /**
     * Returns an iterator for the current iterator entry
     * @return RecursiveDOMIterator
     */
    public function getChildren()
    {
        return new self($this->current());
    }

    /**
     * Returns if an iterator can be created for the current entry.
     * @return Boolean
     */
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    /**
     * Returns the current position
     * @return Integer
     */
    public function key()
    {
        return $this->_position;
    }

    /**
     * Moves the current position to the next element.
     * @return void
     */
    public function next()
    {
        $this->_position++;
    }

    /**
     * Rewind the Iterator to the first element
     * @return void
     */
    public function rewind()
    {
        $this->_position = 0;
    }

    /**
     * Checks if current position is valid
     * @return Boolean
     */
    public function valid()
    {
        return $this->_position < $this->_nodeList->length;
    }
}

You can use that in combination with a RecursiveIteratorIterator. Usage examples are on the page.

In general though, it would be easier to use XPath to search for blacklisted nodes instead of traversing the DOM Tree. Also keep in mind that DOM is already quite good at preventing XSS by automatically escaping xml entities in nodeValues.

The other thing you have to be aware of is that any manipulation of a DOMDocument will immediately affect any DOMNodeList you might have from XPath queries and that might lead to skipped nodes when manipulating them. See DOMNode replacement with PHP's DOM classes for an example.

这篇关于遍历DOM树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆