使用Regex删除JavaScript [英] Remove JavaScript with Regex

查看：73 发布时间：2019/5/27 16:38:28 c# javascript regex

本文介绍了使用Regex删除JavaScript的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我无法使用C＃从HTML页面中删除所有javascript。我有三个正则表达式删除了很多，但也错过了很多。使用MSHTML DOM解析器解析javascript会导致javascript实际运行，这正是我试图通过使用正则表达式来避免的。

I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.

    "<script.*/>"

    "<script[^>]*>.*</script>"

    "<script.*?>[\\s\\S]*?</.*?script>"

有没有人知道我错过了什么导致这三个正则表达式错过了JavaScript块？

Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?

我想删除的一个例子：

<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
    <script type="text/javascript">
    <!--
        var Time=new Application('Time')
    //-->
    </script>
    <script type="text/javascript">
        if(window['com.actions']) {
            window['com.actions'].approvalStatement =  "",
            window['com.actions'].hasApprovalStatement = false
        }
    </script>

推荐答案

我假设您正在尝试简单地清理输入JavaScript的。坦率地说，我担心这太简单了解决方案，因为它看起来非常简单。在表达式之后（在C＃字符串中）见下面的推理：

I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):

@"(?s)<script.*?(/>|</script>)"

就是这样 - 我希望！（这当然适用于你的例子！）

That's it - I hope! (It certainly works for your examples!)

我的简单理由是，试图用正则表达式解析HTML的主要问题是嵌套标签的可能性 - 它不是如此多的DIFFERENT标签的嵌套，但SYNONYMOUS标签的嵌套

My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags

例如，

<b> bold <i> AND italic </i></b>

...不是很糟糕，但是

...is not so bad, but

<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>

将难以解析，因为结束标记是IDENTICAL。

would be much harder to parse, because the ending tags are IDENTICAL.

但是，由于嵌套脚本标签无效，因此 /> （< -is this valid？）或< / script> 是此脚本块的结尾。


However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.
脚本标记内总是存在HTML注释或CDATA标记的可能性，但如果它们不包含< / script> ，则应该没问题。但是：如果他们这样做，肯定有可能获得一些代码。我不认为页面会呈现，但是一些HTML解析器非常灵活，所以你永远不会知道。要处理一些额外可能的空格，你可以使用：
There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:
@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"

请告诉我是否可以找到一种方法来破解它，通过可运行的JavaScript 通过VALID HTML代码（我知道有一些获取某些内容的方法，但是如果它通过了，它应该以多种不同的方式被破解，并且不应该是可运行的JavaScript代码。）

Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)

                        这篇关于使用Regex删除JavaScript的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用Regex删除JavaScript [英] Remove JavaScript with Regex

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

使用Regex删除JavaScript [英] Remove JavaScript with Regex

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭