所有长度的HtmlAgilityPack子字符串 [英] HtmlAgilityPack substring of all by length

查看:67
本文介绍了所有长度的HtmlAgilityPack子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有带有嵌套元素的html(主要是 div p 元素)我需要返回相同的html,但是由给定数量的字母substring了.显然,字母计数不应通过html标签枚举,而应仅计数每个html元素的InnerText字母.HTML结果应保留适当的结构-任何结束标记,以保持有效的html.

I have html with nested elements (mostly just div and p elements) I need to return the same html, but substring'ed by a given number of letters. Obviously the letter count should not enumerate through html tags, but only count letters of InnerText of each html element. Html result should preserve proper structure - any closing tags in order to stay valid html.

样本输入:

<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</p>
    <div>
        <p>some more text some more text some more text some more text some more text</p>
        <p>some more text some more text some more text some more text some more text</p>
    </div>
</div>

给出 int length = 16 ,输出应如下所示:

Given int length = 16 the output should look like this:

<div>
    <p>some text</p> // 9 characters in the InnerText here
    <p>some mo</p> // 7 characters in the InnerText here; 9 + 7 = 16;
</div>

请注意,字母(包括空格)的数目为16.由于字母计数已达到变量 length ,因此取消了随后的< div> .请注意,输出html仍然有效.

Notice that the number of letters (including spaces) is 16. The subsequent <div> is eliminated since the letter count has reached variable length. Notice that output html is still valid.

我已经尝试了以下方法,但是那实际上是行不通的.输出结果与预期不符:某些html元素重复出现.

I've tried the following, but that does not really work. The output is not as expected: some html elements get repeated.

public static string SubstringHtml(this string html, int length)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    int totalLength = 0;
    StringBuilder output = new StringBuilder();
    foreach (var node in doc.DocumentNode.Descendants())
    {
        totalLength += node.InnerText.Length;
        if(totalLength >= length)
        {
            int difference = totalLength - length;
            string lastPiece = node.InnerText.ToString().Substring(0, difference);
            output.Append(lastPiece);
            break;
        }
        else
        {
            output.Append(node.InnerHtml);
        }
    }
    return output.ToString();
}


更新

@SergeBelov提供了一种适用于第一个示例输入的解决方案,但是进一步的测试却出现了如下所示的输入问题.

@SergeBelov provided a solution that works for the first sample input, however further testing presented an issue with an input like the one below.

样本输入2:

some more text some more text 
<div>
    <p>some text</p>
    <p>some more text some more text some more text some more text some more text</
</div>

给定变量 int maxLength = 7; ,输出应等于 some mo .由于此代码,其中 ParentNode = null :

Given that variable int maxLength = 7; an output should be equal to some mo. It does not work like that because of this code where ParentNode = null:

lastNode
    .Node
    .ParentNode
    .ReplaceChild(HtmlNode.CreateNode(lastNodeText.InnerText.Substring(0, lastNode.NodeLength - lastNode.TotalLength + maxLength)), lastNode.Node);

创建一个新的HtmlNode似乎无济于事,因为它的InnterText属性是只读的.

Creating a new HtmlNode does not seem to help because its InnterText property is readonly.

推荐答案

下面的小型控制台程序说明了一种可能的方法,即:

The small console program below illustrates one possible approach, which is:

  1. 选择相关的文本节点并计算它们的运行总长度;
  2. 根据需要获取尽可能多的节点,以达到超出最大长度的运行总数;
  3. 从文档中删除所有元素节点,但那些是我们在步骤#1、2中选择的节点的祖先节点除外;
  4. 在列表的最后一个节点中剪切文本以适合最大长度.

更新:此操作仍应在文本节点为第一个节点的情况下进行;可能需要 Trim()才能从其中删除空格,如下所示.

UPDATE: This should still work with a text node being the first; probably, a Trim() is required to remove the whitespace from it as below.

    static void Main(string[] args)
    {
        int maxLength = 9;
        string input = @"
            some more text some more text 
            <div>
                <p>some text</p>
                <p>some more text some more text some more text some more text some more text</
            </div>";

        var doc = new HtmlDocument();
        doc.LoadHtml(input);

        // Get text nodes with the appropriate running total
        var acc = 0;
        var nodes = doc.DocumentNode
            .Descendants()
            .Where(n => n.NodeType == HtmlNodeType.Text && n.InnerText.Trim().Length > 0)
            .Select(n => 
            {
                var length = n.InnerText.Trim().Length;
                acc += length;
                return new { Node = n, TotalLength = acc, NodeLength = length }; 
            })
            .TakeWhile(n => (n.TotalLength - n.NodeLength) < maxLength)
            .ToList();

        // Select element nodes we intend to keep
        var nodesToKeep = nodes
            .SelectMany(n => n.Node.AncestorsAndSelf()
                .Where(m => m.NodeType == HtmlNodeType.Element));

        // Select and remove element nodes we don't need
        var nodesToDrop = doc.DocumentNode
            .Descendants()
            .Where(m => m.NodeType == HtmlNodeType.Element)
            .Except(nodesToKeep)
            .ToList();

        foreach (var r in nodesToDrop)
            r.Remove();

        // Shorten the last node as required
        var lastNode = nodes.Last();
        var lastNodeText = lastNode.Node;
        var text = lastNodeText.InnerText.Trim().Substring(0,
                lastNode.NodeLength - lastNode.TotalLength + maxLength);
        lastNodeText
            .ParentNode
            .ReplaceChild(HtmlNode.CreateNode(text), lastNodeText);

        doc.Save(Console.Out);
    }

这篇关于所有长度的HtmlAgilityPack子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆