如何使用HTML Agility Pack清理格式不正确的HTML [英] How to clean up poorly formed HTML using HTML Agility Pack
问题描述
我正在尝试替换这个可怕的正则表达式集合,该集合目前用于清理格式不正确的HTML块,并且偶然发现了C#的HTML Agility Pack.它看起来非常强大,但我找不到如何使用该包的示例,在我看来,这将是其中包含的所需功能.我确定我是个白痴,在文档中找不到合适的方法.
I am attempting to replace this god awful collection of regular expressions that is currently used to clean up blocks of poorly formed HTML and stumbled upon the HTML Agility Pack for C#. It looks very powerful but yet, I couldn't find an example of how I want to use the pack which, in my mind, would be a desired functionality included in it. I am sure I am an idiot and cannot find a suitable method in the documentation.
让我解释一下...说我有以下html:
Let me explain... say I had the following html:
<p class="someclass">
<font size="3">
<font face="Times New Roman">
this is some text
<a href="somepage.html">Some link</a>
</font>
</font>
</p>
...我想要的样子:
... that I want to look like:
<p>
this is some text
<a href="somepage.html">Some link</a>
</p>
当我使用HtmlNode.Remove()方法时,它将删除该节点及其所有子节点.有没有办法删除保留子节点的节点?
When I utilize the HtmlNode.Remove() method it removes the node plus all it's children. Is there a way to remove the node preserving the children?
推荐答案
在HtmlNode上,方法RemoveChild具有以下重载:
On HtmlNode, the method RemoveChild has this overload:
public HtmlNode RemoveChild(HtmlNode oldChild, bool keepGrandChildren);
这就是您要这样做的方式:
So this is how you would do it:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourfile.htm");
foreach (HtmlNode font in doc.DocumentNode.SelectNodes("//font"))
{
font.ParentNode.RemoveChild(font, true);
}
:替换为w/keepGrandChildren"选项似乎无法正常工作,因此这是一个替代实现:
It looks like the Replace w/ keepGrandChildren option is not working as expected, so here is an alternate implementation:
public static HtmlNode RemoveChild(HtmlNode parent, HtmlNode oldChild, bool keepGrandChildren)
{
if (oldChild == null)
throw new ArgumentNullException("oldChild");
if (oldChild.HasChildNodes && keepGrandChildren)
{
HtmlNode prev = oldChild.PreviousSibling;
List<HtmlNode> nodes = new List<HtmlNode>(oldChild.ChildNodes.Cast<HtmlNode>());
nodes.Sort(new StreamPositionComparer());
foreach (HtmlNode grandchild in nodes)
{
parent.InsertAfter(grandchild, prev);
}
}
parent.RemoveChild(oldChild);
return oldChild;
}
// this helper class allows to sort nodes using their position in the file.
private class StreamPositionComparer : IComparer<HtmlNode>
{
int IComparer<HtmlNode>.Compare(HtmlNode x, HtmlNode y)
{
return y.StreamPosition.CompareTo(x.StreamPosition);
}
}
这篇关于如何使用HTML Agility Pack清理格式不正确的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!