使用Regex删除JavaScript [英] Remove JavaScript with Regex

查看:73
本文介绍了使用Regex删除JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法使用C#从HTML页面中删除所有javascript。我有三个正则表达式删除了很多,但也错过了很多。使用MSHTML DOM解析器解析javascript会导致javascript实际运行,这正是我试图通过使用正则表达式来避免的。

I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.

    "<script.*/>"

    "<script[^>]*>.*</script>"

    "<script.*?>[\\s\\S]*?</.*?script>"

有没有人知道我错过了什么导致这三个正则表达式错过了JavaScript块?

Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?

我想删除的一个例子:

<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
    <script type="text/javascript">
    <!--
        var Time=new Application('Time')
    //-->
    </script>
    <script type="text/javascript">
        if(window['com.actions']) {
            window['com.actions'].approvalStatement =  "",
            window['com.actions'].hasApprovalStatement = false
        }
    </script>


推荐答案

我假设您正在尝试简单地清理输入JavaScript的。坦率地说,我担心这太简单了解决方案,因为它看起来非常简单。在表达式之后(在C#字符串中)见下面的推理:

I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):

@"(?s)<script.*?(/>|</script>)"

就是这样 - 我希望! (这当然适用于你的例子!)

That's it - I hope! (It certainly works for your examples!)

我的简单理由是,试图用正则表达式解析HTML的主要问题是嵌套标签的可能性 - 它不是如此多的DIFFERENT标签的嵌套,但SYNONYMOUS标签的嵌套

My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags

例如,

<b> bold <i> AND italic </i></b>

...不是很糟糕,但是

...is not so bad, but

<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>

将难以解析,因为结束标记是IDENTICAL。

would be much harder to parse, because the ending tags are IDENTICAL.

但是,由于嵌套脚本标签无效,因此 /> (< -is this valid?)或< / script> 是此脚本块的结尾。

However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.

脚本标记内总是存在HTML注释或CDATA标记的可能性,但如果它们不包含< / script> ,则应该没问题。但是:如果他们这样做,肯定有可能获得一些代码。我不认为页面会呈现,但是一些HTML解析器非常灵活,所以你永远不会知道。要处理一些额外可能的空格,你可以使用:

There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:

@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"

请告诉我是否可以找到一种方法来破解它,通过可运行的JavaScript 通过VALID HTML代码(我知道有一些获取某些内容的方法,但是如果它通过了,它应该以多种不同的方式被破解,并且不应该是可运行的JavaScript代码。)

Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)

这篇关于使用Regex删除JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆