.NET从HTML页面删除/剥离JavaScript和CSS代码块 [英] .NET Remove/Strip JavaScript and CSS code blocks from HTML page

查看:61
本文介绍了.NET从HTML页面删除/剥离JavaScript和CSS代码块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有带有JavaScript和CSS代码块的HTML字符串:

I have HTML string with the JavaScript and CSS code blocks:

<script type="text/javascript">

  alert('hello world');

</script>

<style type="text/css">
  A:link {text-decoration: none}
  A:visited {text-decoration: none}
  A:active {text-decoration: none}
  A:hover {text-decoration: underline; color: red;}
</style>

如何剥离这些块?
关于可用于删除那些正则表达式的任何建议吗?

How to strip those blocks? Any suggestion about the regular expressions that can be used to remove those?

推荐答案

快速的'n'脏方法将是这样的正则表达式:

The quick 'n' dirty method would be a regex like this:

var regex = new Regex(
   "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", 
   RegexOptions.Singleline | RegexOptions.IgnoreCase
);

string ouput = regex.Replace(input, "");

更好(但可能更慢)的选项是使用 HtmlAgilityPack

The better* (but possibly slower) option would be to use HtmlAgilityPack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);

var nodes = doc.DocumentNode.SelectNodes("//script|//style");

foreach (var node in nodes)
    node.ParentNode.RemoveChild(node);

string htmlOutput = doc.DocumentNode.OuterHtml;

*)有关为什么更好的讨论,请参见此线程

*) For a discussion about why it's better, see this thread.

这篇关于.NET从HTML页面删除/剥离JavaScript和CSS代码块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆