使用HtmlAgilityPack将HTML字符串拆分为两部分 [英] Splitting HTML string into two parts with HtmlAgilityPack

查看:98
本文介绍了使用HtmlAgilityPack将HTML字符串拆分为两部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找使用HtmlAgilityPack在C#中将HTML文档分成一些标签的最佳方法。我想保留预期的标记,因为我正在进行拆分。这里是一个例子。

I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.

如果文档是这样的:

If the document is like this:

<p>
<div>
    <p>
        Stuff
    </p>
    <p>
        <ul>
            <li>Bullet 1</li>
            <li><a href="#">link</a></li>
            <li>Bullet 3</li>
        </ul>
    </p>
            <span>Footer</span>
</div>
</p>

分解后,它应该如下所示:

Once it's split, it should look like this:

第一部分

<p>
<div>
    <p>
        Stuff
    </p>
    <p>
        <ul>
            <li>Bullet 1</li>
        </ul>
    </p>
</div>
</p>

第2部分

Part 2

<p>
<div>
    <p>
        <ul>
            <li>Bullet 3</li>
        </ul>
    </p>
            <span>Footer</span>
</div>

</p>

做这种事情的最佳方式是什么?

What would be the best way of doing something like that?

推荐答案

这是我想出来的。这会进行拆分并移除发生拆分的元素的空元素。

Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.

    private static void SplitDocument()
    {
        var doc = new HtmlDocument();
        doc.Load("HtmlDoc.html");
        var links = doc.DocumentNode.SelectNodes("//a[@href]");
        var firstPart = GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
        var secondPart = GetSecondPart(links[0]).DocumentNode.InnerHtml;
    }

    private static HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
    {
        var nodeStack = new Stack<Tuple<HtmlNode, HtmlNode>>();        
        var newDoc = new HtmlDocument();
        var parent = newDoc.DocumentNode;

        nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(currNode, parent));

        while (nodeStack.Count > 0)
        {
            var curr = nodeStack.Pop();
            var copyNode = curr.Item1.CloneNode(false);
            curr.Item2.AppendChild(copyNode);

            if (curr.Item1 == link)
            {
                var nodeToRemove = NodeAndEmptyAncestors(copyNode);
                nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
                break;
            }

            for (var i = curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
            {
                nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
            }
        }

        return newDoc;
    }

    private static HtmlDocument GetSecondPart(HtmlNode link)
    {
        var nodeStack = new Stack<HtmlNode>();
        var newDoc = new HtmlDocument();

        var currNode = link;
        while (currNode.ParentNode != null)
        {
            currNode = currNode.ParentNode;
            nodeStack.Push(currNode.CloneNode(false));
        }

        var parent = newDoc.DocumentNode;
        while (nodeStack.Count > 0)
        {
            var node = nodeStack.Pop();
            parent.AppendChild(node);
            parent = node;
        }

        var newLink = link.CloneNode(false);
        parent.AppendChild(newLink);

        currNode = link;
        var newParent = newLink.ParentNode;

        while (currNode.ParentNode != null)
        {
            var foundNode = false;
            foreach (var child in currNode.ParentNode.ChildNodes)
            {
                if (foundNode) newParent.AppendChild(child.Clone());
                if (child == currNode) foundNode = true;
            }

            currNode = currNode.ParentNode;
            newParent = newParent.ParentNode;
        }

        var nodeToRemove = NodeAndEmptyAncestors(newLink);
        nodeToRemove.ParentNode.RemoveChild(nodeToRemove);

        return newDoc;
    }

    private static HtmlNode NodeAndEmptyAncestors(HtmlNode node)
    {
        var currNode = node;
        while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
        {
            currNode = currNode.ParentNode;
        }

        return currNode;
    }

这篇关于使用HtmlAgilityPack将HTML字符串拆分为两部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆