如何过滤所有的HTML标签,除了一定的白名单中? [英] How do I filter all HTML tags except a certain whitelist?

查看:1533
本文介绍了如何过滤所有的HTML标签,除了一定的白名单中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是.NET。 IGNORECASE设置和多未设置。

This is for .NET. IgnoreCase is set and MultiLine is NOT set.

通常我在正则表达式体面的,也许我跑低咖啡因...

Usually I'm decent at regex, maybe I'm running low on caffeine...

用户被允许进入HTML-CN codeD实体(小于LT中,<放大器;等),并使用以下HTML标签:

Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:

u, i, b, h3, h4, br, a, img

自结束&lt; BR /&GT;和&lt; IMG /&GT;是允许的,有或没有额外的空间,但并不是必需的。

Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.

我想:


  1. 地带所有的开始和结束HTML比上面列出的其他标记。

  2. 从剩余的标记中除去属性,除了的锚可以有一个href。

  1. Strip all starting and ending HTML tags other than those listed above.
  2. Remove attributes from the remaining tags, except anchors can have an href.

我的搜索模式(带一个空字符串替换)为止:

My search pattern (replaced with an empty string) so far:

<(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>

此的看起来的被剥离所有我想要的开始和结束标记,但有三个问题:

This seems to be stripping all but the start and end tags I want, but there are three problems:


  1. 有包括允许每个标签的结束标记的版本是丑陋的。

  2. 的属性生存。这可能发生在一个单一的替换?

  3. 标签的开始的允许的标签名漏网之鱼。例如,&所述;缩写&gt;中和&LT; IFRAME&gt;中。

  1. Having to include the end tag version of each allowed tag is ugly.
  2. The attributes survive. Can this happen in a single replacement?
  3. Tags starting with the allowed tag names slip through. E.g., "<abbrev>" and "<iframe>".

下面的建议模式并不去掉一些没有属性的标签。

The following suggested pattern does not strip out tags that have no attributes.

</?(?!i|b|h3|h4|a|img)\b[^>]*>

如下所述,&gt;中在属性值合法的,但它是安全的说我不会支持。另外,也不会有CDATA块等后顾之忧。只是一点点的HTML。

As mentioned below, ">" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.

漏洞的回答是最好的,到目前为止,谢谢!下面是他的模式(希望在pre能更好地工作):

Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

一些小的调整,我认为仍然可以以这个答案提出:

Some small tweaks I think could still be made to this answer:


  1. 我认为这可能进行修改,以获取简单的HTML注释(那些本身不包含标签)加入! - 到可接受的变量,使一个小改动,前年底pression允许可选尾随\\ S -

  1. I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "\s--".

我想如果有属性的多个空白字符(例如:全副格式的HTML换行和标签的属性之间),这将打破。

I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).

修改2009-07-23:下面是最终的解决方案我去(在VB.NET):

Edit 2009-07-23: Here's the final solution I went with (in VB.NET):

 Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
 Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
      ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
 html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)

需要说明的是,标签的href属性仍然得到洗刷,这是不理想的。

The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.

推荐答案

下面是一个函数我写了这个任务:

Here's a function I wrote for this task:

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

编辑:出于某种原因,我发布了修正我的previous答案作为一个单独的答案,所以我在这里他们巩固

For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.

我将解释的正则表达式一点,因为它是一个有点长。

I will explain the regex a bit, because it is a little long.

第一部分开放式支架和0或1斜杠(如果它是一个结束标记)。

The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).

接下来你看到一个if-then具有超前的外观构造。 ((= SomeTag)然后|?别的)我检查,看看是否字符串的下一个部分是可以接受的标签之一。你可以看到,我并置正则表达式字符串可接受的变量,它是一个由verticle栏分隔,使任何条款将匹配接受的标签名称。如果是一场比赛,你可以看到我把单词notag,因为没有标签将会匹配,如果它是可以接受的我想息事宁人。否则,我移动到其他部分,其中i匹配任何标签名[A-Z,A-Z,0-9] +

Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+

接下来,我想匹配0或多个属性,我假设的形式为属性=值。所以现在我组这部分重新presenting一个属性,但是我用的是:以prevent这组被抓获飞车:(?:?\\ S即[az,AZ,0-9, - ] + = ?(?:???([,'])的 \\ 1))

Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:\s[a-z,A-Z,0-9,-]+=?(?:(["",']?).?\1?))

在这里,我开始与空格字符,这将是在标签之间和属性名,然后匹配属性名称:[A-Z,A-Z,0-9, - ] +

Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+

我旁边匹配一个等号,然后或者报价。 I组报价,因此将被捕获,后来\\ 1我可以做反向引用匹配同一类型的报价。在这两个引号之间,你可以看到我用的期间匹配任何,但是我用的是懒人版*?而不是贪婪的版本*,这样只会匹配到下一个报价将结束此值。

next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later \1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.

接下来我们把一个*括号与关闭组,这样它会匹配多个attirbute /值组合(或无)之后。最后我们搭配一些空白与\\ s和0或1的结局在XML风格自我结束标记标记斜杠。

next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with \s, and 0 or 1 ending slashes in the tag for xml style self closing tags.

您可以看到我与香肠更换标签,因为我饿了,但你可以用空字符串替换它们也只是清除出来。

You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.

这篇关于如何过滤所有的HTML标签,除了一定的白名单中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆