如何使用 HTML Agility Pack 修复格式错误的 HTML? [英] How to fix ill-formed HTML with HTML Agility Pack?

查看:29
本文介绍了如何使用 HTML Agility Pack 修复格式错误的 HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个带有重叠标签的格式错误的 HTML:

I have this ill-formed HTML with overlapping tags:

<p>word1<b>word2</p>
<p>word3</b>word4</p>

重叠也可以嵌套.

如何使用 HTML Agility Pack (HAP) 将其转换为格式良好的 HTML?

我正在寻找这个输出:

<p>word1<b>word2</b></p>
<p><b>word3</b>word4</p>

我试过 HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed |HtmlElementFlag.CanOverlap,但它没有按预期工作.

I tried HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap, but it does not work as expected.

推荐答案

它实际上按预期工作,但可能不像预期的那样工作.无论如何,这里有一段示例代码(一个控制台应用程序),它演示了如何使用该库实现一些 HTML 修复.

It is in fact working as expected, but maybe not working as you expected. Anyway, here is a sample piece of code (a Console application) that demonstrates how you can achieve some HTML fixing using the library.

该库有一个 ParseErrors 集合,您可以使用它来确定标记解析期间检测到的错误.

The library has a ParseErrors collection that you can use to determine what errors were detecting during markup parsing.

这里确实有两种类型的问题:

There are really two types of problems here:

1) 未封闭的元素.默认情况下,该库已修复此问题,但在 P 元素上有一个选项可以防止这种情况发生.

1) unclosed elements. This one is fixed by default by the library, but there is an option on the P element that prevents that in this case.

2) 未打开的元素.这个更复杂,因为这取决于你想如何修复它,你想在哪里打开标签?在以下示例中,我使用最近的前一个文本兄弟节点打开元素.

2) unopened elements. This one is more complex, because it depends how you want to fix it, where do you want to have the tag opened? In the following sample, I've used the nearest previous text sibling node to open the element.

static void Main(string[] args)
{
    // clear the flags on P so unclosed elements in P will be auto closed.
    HtmlNode.ElementsFlags.Remove("p");

    // load the document
    HtmlDocument doc = new HtmlDocument();
    doc.Load("yourTestFile.htm");

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        HtmlTextNode last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }

    doc.Save(Console.Out);
}

public class NodePositions
{
    public NodePositions(HtmlDocument doc)
    {
        AddNode(doc.DocumentNode);
        Nodes.Sort(new NodePositionComparer());
    }

    private void AddNode(HtmlNode node)
    {
        Nodes.Add(node);
        foreach (HtmlNode child in node.ChildNodes)
        {
            AddNode(child);
        }
    }

    private class NodePositionComparer : IComparer<HtmlNode>
    {
        public int Compare(HtmlNode x, HtmlNode y)
        {
            return x.StreamPosition.CompareTo(y.StreamPosition);
        }
    }

    public List<HtmlNode> Nodes = new List<HtmlNode>();
}

这篇关于如何使用 HTML Agility Pack 修复格式错误的 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆