如何修复不良的HTML与HTML敏捷性包? [英] How to fix ill-formed HTML with HTML Agility Pack?

查看:202
本文介绍了如何修复不良的HTML与HTML敏捷性包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个病的HTML重叠标签:

I have this ill-formed HTML with overlapping tags:

<p>word1<b>word2</p>
<p>word3</b>word4</p>

重叠可以嵌套了。

The overlapping can be nested, too.

我怎样才能将其转换成与HTML敏捷性包(HAP)?良好的HTML

我在寻找这样的输出:

<p>word1<b>word2</b></p>
<p><b>word3</b>word4</p>

我试过 HtmlNode.ElementsFlags [B] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap ,但预期这是行不通的。

I tried HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap, but it does not work as expected.

推荐答案

事实上,它是按预期工作,但也许不能正常工作的的预期。总之,这里是一个试片的code(控制台应用程序),演示了如何使用该库实现一些HTML固定。

It is in fact working as expected, but maybe not working as you expected. Anyway, here is a sample piece of code (a Console application) that demonstrates how you can achieve some HTML fixing using the library.

该库有一个 ParseErrors 集合,你可以用它来确定哪些错误时标记解析被检测。

The library has a ParseErrors collection that you can use to determine what errors were detecting during markup parsing.

有真正两种类型的问题在这里:

There are really two types of problems here:

1)未闭合的元素。这一个固定默认由库,但也有是prevents P元素上的一个选项,在这种情况

1) unclosed elements. This one is fixed by default by the library, but there is an option on the P element that prevents that in this case.

2)的未开封的元素。这一个是比较复杂的,因为这取决于你想怎么解决这个问题,在这里你想有标签打开?在下面的示例中,我使用了最近previous文本同级节点打开的元素。

2) unopened elements. This one is more complex, because it depends how you want to fix it, where do you want to have the tag opened? In the following sample, I've used the nearest previous text sibling node to open the element.

static void Main(string[] args)
{
    // clear the flags on P so unclosed elements in P will be auto closed.
    HtmlNode.ElementsFlags.Remove("p");

    // load the document
    HtmlDocument doc = new HtmlDocument();
    doc.Load("yourTestFile.htm");

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        HtmlTextNode last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }

    doc.Save(Console.Out);
}

public class NodePositions
{
    public NodePositions(HtmlDocument doc)
    {
        AddNode(doc.DocumentNode);
        Nodes.Sort(new NodePositionComparer());
    }

    private void AddNode(HtmlNode node)
    {
        Nodes.Add(node);
        foreach (HtmlNode child in node.ChildNodes)
        {
            AddNode(child);
        }
    }

    private class NodePositionComparer : IComparer<HtmlNode>
    {
        public int Compare(HtmlNode x, HtmlNode y)
        {
            return x.StreamPosition.CompareTo(y.StreamPosition);
        }
    }

    public List<HtmlNode> Nodes = new List<HtmlNode>();
}

这篇关于如何修复不良的HTML与HTML敏捷性包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆