使用Regex删除JavaScript [英] Remove JavaScript with Regex
问题描述
我无法使用C#从HTML页面中删除所有javascript。我有三个正则表达式删除了很多,但也错过了很多。使用MSHTML DOM解析器解析javascript会导致javascript实际运行,这正是我试图通过使用正则表达式来避免的。
I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.
"<script.*/>"
"<script[^>]*>.*</script>"
"<script.*?>[\\s\\S]*?</.*?script>"
有没有人知道我错过了什么导致这三个正则表达式错过了JavaScript块?
Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?
我想删除的一个例子:
<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
<script type="text/javascript">
<!--
var Time=new Application('Time')
//-->
</script>
<script type="text/javascript">
if(window['com.actions']) {
window['com.actions'].approvalStatement = "",
window['com.actions'].hasApprovalStatement = false
}
</script>
推荐答案
我假设您正在尝试简单地清理输入JavaScript的。坦率地说,我担心这太简单了解决方案,因为它看起来非常简单。在表达式之后(在C#字符串中)见下面的推理:
I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):
@"(?s)<script.*?(/>|</script>)"
就是这样 - 我希望! (这当然适用于你的例子!)
That's it - I hope! (It certainly works for your examples!)
我的简单理由是,试图用正则表达式解析HTML的主要问题是嵌套标签的可能性 - 它不是如此多的DIFFERENT标签的嵌套,但SYNONYMOUS标签的嵌套
My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags
例如,
<b> bold <i> AND italic </i></b>
...不是很糟糕,但是
...is not so bad, but
<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>
将难以解析,因为结束标记是IDENTICAL。
would be much harder to parse, because the ending tags are IDENTICAL.
但是,由于嵌套脚本
标签无效,因此 /> $ c $的下一个实例c>(< -is this valid?)或
< / script>
是此脚本块的结尾。
However, since it is invalid to nest script
tags, the next instance of />
(<-is this valid?) or </script>
is the end of this script block.
脚本标记内总是存在HTML注释或CDATA标记的可能性,但如果它们不包含< / script>
,则应该没问题。但是:如果他们这样做,肯定有可能获得一些代码。我不认为页面会呈现,但是一些HTML解析器非常灵活,所以你永远不会知道。要处理一些额外可能的空格,你可以使用:
There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>
. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:
@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"
请告诉我是否可以找到一种方法来破解它,通过可运行的JavaScript 通过VALID HTML代码(我知道有一些获取某些内容的方法,但是如果它通过了,它应该以多种不同的方式被破解,并且不应该是可运行的JavaScript代码。)
Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)
这篇关于使用Regex删除JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!