htmlagilitypack - 删除脚本和样式? [英] htmlagilitypack - remove script and style?
本文介绍了htmlagilitypack - 删除脚本和样式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
即时通讯使用下面的方法来提取文本格式的HTML:
公共字符串getAllText(字符串_html)
{
字符串_allText =;
尝试
{
HtmlAgilityPack.HtmlDocument文档=新HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
VAR根= document.DocumentNode;
VAR SB =新的StringBuilder();
的foreach(在root.DescendantNodesAndSelf()VAR节点)
{
如果(!node.HasChildNodes)
{
字符串文本= node.InnerText;
如果(!string.IsNullOrEmpty(文本))
sb.AppendLine(text.Trim());
}
} _allText = sb.ToString(); }
赶上(例外)
{
} _allText = System.Web.HttpUtility.HtmlDe code(_allText); 返回_allText;
}
问题是,我还可以得到脚本和风格标签。
我怎么能拒绝呢?
解决方案
HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);doc.DocumentNode.Descendants()
。凡(N => n.Name ==脚本|| n.Name ==风格)
.ToList()
.ForEach(N => n.Remove());
Im using the following method to extract text form html:
public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
_allText = sb.ToString();
}
catch (Exception)
{
}
_allText = System.Web.HttpUtility.HtmlDecode(_allText);
return _allText;
}
Problem is that i also get script and style tags.
How could i exclude them?
解决方案
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());
这篇关于htmlagilitypack - 删除脚本和样式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文