所有长度的HtmlAgilityPack子字符串 [英] HtmlAgilityPack substring of all by length
问题描述
我有带有嵌套元素的html(主要是 div 和 p 元素)我需要返回相同的html,但是由给定数量的字母substring了.显然,字母计数不应通过html标签枚举,而应仅计数每个html元素的InnerText字母.HTML结果应保留适当的结构-任何结束标记,以保持有效的html.
I have html with nested elements (mostly just div and p elements) I need to return the same html, but substring'ed by a given number of letters. Obviously the letter count should not enumerate through html tags, but only count letters of InnerText of each html element. Html result should preserve proper structure - any closing tags in order to stay valid html.
样本输入:
<div>
<p>some text</p>
<p>some more text some more text some more text some more text some more text</p>
<div>
<p>some more text some more text some more text some more text some more text</p>
<p>some more text some more text some more text some more text some more text</p>
</div>
</div>
给出 int length = 16
,输出应如下所示:
Given int length = 16
the output should look like this:
<div>
<p>some text</p> // 9 characters in the InnerText here
<p>some mo</p> // 7 characters in the InnerText here; 9 + 7 = 16;
</div>
请注意,字母(包括空格)的数目为16.由于字母计数已达到变量 length
,因此取消了随后的< div>
.请注意,输出html仍然有效.
Notice that the number of letters (including spaces) is 16. The subsequent <div>
is eliminated since the letter count has reached variable length
. Notice that output html is still valid.
我已经尝试了以下方法,但是那实际上是行不通的.输出结果与预期不符:某些html元素重复出现.
I've tried the following, but that does not really work. The output is not as expected: some html elements get repeated.
public static string SubstringHtml(this string html, int length)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
int totalLength = 0;
StringBuilder output = new StringBuilder();
foreach (var node in doc.DocumentNode.Descendants())
{
totalLength += node.InnerText.Length;
if(totalLength >= length)
{
int difference = totalLength - length;
string lastPiece = node.InnerText.ToString().Substring(0, difference);
output.Append(lastPiece);
break;
}
else
{
output.Append(node.InnerHtml);
}
}
return output.ToString();
}
更新
@SergeBelov提供了一种适用于第一个示例输入的解决方案,但是进一步的测试却出现了如下所示的输入问题.
@SergeBelov provided a solution that works for the first sample input, however further testing presented an issue with an input like the one below.
样本输入2:
some more text some more text
<div>
<p>some text</p>
<p>some more text some more text some more text some more text some more text</
</div>
给定变量 int maxLength = 7;
,输出应等于 some mo .由于此代码,其中 ParentNode = null
:
Given that variable int maxLength = 7;
an output should be equal to some mo.
It does not work like that because of this code where ParentNode = null
:
lastNode
.Node
.ParentNode
.ReplaceChild(HtmlNode.CreateNode(lastNodeText.InnerText.Substring(0, lastNode.NodeLength - lastNode.TotalLength + maxLength)), lastNode.Node);
创建一个新的HtmlNode似乎无济于事,因为它的InnterText属性是只读的.
Creating a new HtmlNode does not seem to help because its InnterText property is readonly.
推荐答案
下面的小型控制台程序说明了一种可能的方法,即:
The small console program below illustrates one possible approach, which is:
- 选择相关的文本节点并计算它们的运行总长度;
- 根据需要获取尽可能多的节点,以达到超出最大长度的运行总数;
- 从文档中删除所有元素节点,但那些是我们在步骤#1、2中选择的节点的祖先节点除外;
- 在列表的最后一个节点中剪切文本以适合最大长度.
更新:此操作仍应在文本节点为第一个节点的情况下进行;可能需要 Trim()
才能从其中删除空格,如下所示.
UPDATE: This should still work with a text node being the first; probably, a Trim()
is required to remove the whitespace from it as below.
static void Main(string[] args)
{
int maxLength = 9;
string input = @"
some more text some more text
<div>
<p>some text</p>
<p>some more text some more text some more text some more text some more text</
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(input);
// Get text nodes with the appropriate running total
var acc = 0;
var nodes = doc.DocumentNode
.Descendants()
.Where(n => n.NodeType == HtmlNodeType.Text && n.InnerText.Trim().Length > 0)
.Select(n =>
{
var length = n.InnerText.Trim().Length;
acc += length;
return new { Node = n, TotalLength = acc, NodeLength = length };
})
.TakeWhile(n => (n.TotalLength - n.NodeLength) < maxLength)
.ToList();
// Select element nodes we intend to keep
var nodesToKeep = nodes
.SelectMany(n => n.Node.AncestorsAndSelf()
.Where(m => m.NodeType == HtmlNodeType.Element));
// Select and remove element nodes we don't need
var nodesToDrop = doc.DocumentNode
.Descendants()
.Where(m => m.NodeType == HtmlNodeType.Element)
.Except(nodesToKeep)
.ToList();
foreach (var r in nodesToDrop)
r.Remove();
// Shorten the last node as required
var lastNode = nodes.Last();
var lastNodeText = lastNode.Node;
var text = lastNodeText.InnerText.Trim().Substring(0,
lastNode.NodeLength - lastNode.TotalLength + maxLength);
lastNodeText
.ParentNode
.ReplaceChild(HtmlNode.CreateNode(text), lastNodeText);
doc.Save(Console.Out);
}
这篇关于所有长度的HtmlAgilityPack子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!