.NET从HTML页面删除/剥离JavaScript和CSS代码块 [英] .NET Remove/Strip JavaScript and CSS code blocks from HTML page
本文介绍了.NET从HTML页面删除/剥离JavaScript和CSS代码块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有带有JavaScript和CSS代码块的HTML字符串:
I have HTML string with the JavaScript and CSS code blocks:
<script type="text/javascript">
alert('hello world');
</script>
<style type="text/css">
A:link {text-decoration: none}
A:visited {text-decoration: none}
A:active {text-decoration: none}
A:hover {text-decoration: underline; color: red;}
</style>
如何剥离这些块?
关于可用于删除那些正则表达式的任何建议吗?
How to strip those blocks? Any suggestion about the regular expressions that can be used to remove those?
推荐答案
快速的'n'脏方法将是这样的正则表达式:
The quick 'n' dirty method would be a regex like this:
var regex = new Regex(
"(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
string ouput = regex.Replace(input, "");
更好(但可能更慢)的选项是使用 HtmlAgilityPack :
The better* (but possibly slower) option would be to use HtmlAgilityPack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);
var nodes = doc.DocumentNode.SelectNodes("//script|//style");
foreach (var node in nodes)
node.ParentNode.RemoveChild(node);
string htmlOutput = doc.DocumentNode.OuterHtml;
*)有关为什么更好的讨论,请参见此线程。
*) For a discussion about why it's better, see this thread.
这篇关于.NET从HTML页面删除/剥离JavaScript和CSS代码块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文