卸下所有的HTML空/不必要的节点 [英] Remove all empty/unnecessary nodes from HTML

查看:172
本文介绍了卸下所有的HTML空/不必要的节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是去除所有空和unnecessery节点的首选方法是什么?例如:

What would be the preferred way to remove all empty and unnecessery nodes? For example

< P>< / P> 应该被删除,< ;字体>&所述p为H.;&下;跨度>&所述峰; br>&下; /跨度>&下; / p>&下; /字体> 也应被删除(这样br标签被认为是在unneccesery这种情况下)

<p></p> should be removed and <font><p><span><br></span></p></font> should also be removed (so the br tag is considered unneccesery in this case)

我将不得不使用某种形式的递归函数为这个?

Will I have to use some sort of recursive function for this? I'm thinking something along the lines of this maybe:

 RemoveEmptyNodes(HtmlNode containerNode)
 {
     var nodes = containerNode.DescendantsAndSelf().ToList();

      if (nodes != null)
      {
          foreach (HtmlNode node in nodes)
          {
              if (node.InnerText == null || node.InnerText == "")
              {
                   RemoveEmptyNodes(node.ParentNode);
                   node.Remove();
               }
           }
       }
  }



不过,这显然行不通(计算器除外)。

But that obviously doesn't work (stackoverflow exception).

推荐答案

这不应该被删除则可以将名称添加到列表中的标签和属性节点也不会被删除,因为containerNode.Attributes.Count == 0(例如图像)

tags that should not be removed you can add the names to the list and nodes with attributes are also not removed because of containerNode.Attributes.Count == 0 (e.g. Images)

static List<string> _notToRemove;

static void Main(string[] args)
{
    _notToRemove = new List<string>();
    _notToRemove.Add("br");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml("<html><head></head><body><p>test</p><br><font><p><span></span></p></font></body></html>");
    RemoveEmptyNodes(doc.DocumentNode);
}

static void RemoveEmptyNodes(HtmlNode containerNode)
{
    if (containerNode.Attributes.Count == 0 && !_notToRemove.Contains(containerNode.Name) && (containerNode.InnerText == null || containerNode.InnerText == string.Empty) )
    {
        containerNode.Remove();
    }
    else
    {
        for (int i = containerNode.ChildNodes.Count - 1; i >= 0; i-- )
        {
            RemoveEmptyNodes(containerNode.ChildNodes[i]);
        }
    }
}

这篇关于卸下所有的HTML空/不必要的节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆