如何过滤除特定白名单之外的所有 HTML 标签? [英] How do I filter all HTML tags except a certain whitelist?

查看:30
本文介绍了如何过滤除特定白名单之外的所有 HTML 标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是用于 .NET.设置了 IgnoreCase 而未设置 MultiLine.

This is for .NET. IgnoreCase is set and MultiLine is NOT set.

通常我擅长正则表达式,也许我的咖啡因不足...

Usually I'm decent at regex, maybe I'm running low on caffeine...

允许用户输入 HTML 编码的实体(<、<amp; 等),并使用以下 HTML 标签:

Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:

u, i, b, h3, h4, br, a, img

自闭式 <br/>和<img/>允许有或没有额外空间,但不是必需的.

Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.

我想:

  1. 去除以上所列之外的所有开始和结束 HTML 标签.
  2. 从剩下的标签中移除属性,除了锚点可以有一个href.
  1. Strip all starting and ending HTML tags other than those listed above.
  2. Remove attributes from the remaining tags, except anchors can have an href.

到目前为止我的搜索模式(替换为空字符串):

My search pattern (replaced with an empty string) so far:

<(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>

似乎去除了除我想要的开始和结束标签之外的所有标签,但存在三个问题:

This seems to be stripping all but the start and end tags I want, but there are three problems:

  1. 必须包含每个允许标记的结束标记版本是很丑陋的.
  2. 属性继续存在.这会在一次更换中发生吗?
  3. 标签开头,允许的标签名称会漏掉.例如,<abbrev>"和