如何使用HTML Agility Pack和C#删除HTML源代码中的空格 [英] How do I remove whitespace in HTML Source with Html Agility Pack and C#
问题描述
在发布之前,我尝试过此线程的解决方案:
Before posting I tried the solution from this thread:
以下是我正在使用的HTML的摘要:
Here is a snippet of the HTML I'm working with:
<p>This is my text</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>This is next text</p>
我正在使用HTML Agility Pack清理HTML:
I'm using HTML Agility Pack to clean up the HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlLocation);
foreach (var item in doc.DocumentNode.Descendants("p").ToList())
{
if (item.InnerHtml == " ")
{
item.Remove();
}
}
上面代码的输出是
<p>This is my text</p>
<p>This is next text</p>
所以我的问题是如何删除HTML源代码中两段之间的多余空格.
So my problem is how do I remove the extra whitespace between the two paragraphs in the HTML source.
推荐答案
删除第一段和最后一段之间的文本节点:
Remove the text nodes between the first and last paragraphs:
HTML:
var html = @"
<p>This is my text</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>This is next text</p>";
解析它:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var paragraphs = doc.DocumentNode.Descendants("p").ToList();
foreach (var item in paragraphs)
{
if (item.InnerHtml == " ") item.Remove();
}
var followingText = paragraphs[0]
.SelectNodes(".//following-sibling::text()")
.ToList();
foreach (var text in followingText)
{
text.Remove();
}
结果:
<p>This is my text</p><p>This is next text</p>
如果要在段落之间保持换行符,请使用for
循环并在所有 last 文本节点上调用Remove()
.>
If you want to keep the line break between the paragraphs, use a for
loop and call Remove()
on all except the last text node.
for (int i = 0; i < followingText.Count - 1; ++i)
{
followingText[i].Remove();
}
结果:
<p>This is my text</p>
<p>This is next text</p>
这篇关于如何使用HTML Agility Pack和C#删除HTML源代码中的空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!